hngopher.com

       [HN Gopher] The Magical Mystery Merge Or Why we run FreeBSD-curr...
       ___________________________________________________________________
        
       The Magical Mystery Merge Or Why we run FreeBSD-current at Netflix
       (2023) [pdf]
        
       Author : ksec
       Score  : 206 points
       Date   : 2024-06-10 06:10 UTC (16 hours ago)
        
 (HTM) web link (people.freebsd.org)
 (TXT) w3m dump (people.freebsd.org)
        
       | ksec wrote:
       | I posted the Serving Netflix Video Traffic at 800Gb/s and Beyond
       | [1] in 2022. For those who are unaware of the context you may
       | want to read the previous PDF and thread. Now we have some
       | update; quote
       | 
       | > Important Performance Milestones:
       | 
       | 2022: First 800Gb/s CDN server 2x AMD 7713, NIC kTLS offload
       | 
       | 2023: First 100Gb/s CDN server consuming only 100W of power,
       | Nvidia Bluefield-3, NIC kTLS offload
       | 
       | My immediate question is if the 2x AMD 7713 actually consumes
       | more than 800W of power. i.e More Watts / Gbps. Even if it does,
       | it is based on 7nm Zen 3 and DDR4 came out in 2021. Would a Zen 5
       | DDR5 outperforms Bluefield in Watts / Gbps?
       | 
       | [1] https://news.ycombinator.com/item?id=32519881
        
         | vitus wrote:
         | Note that your power consumption is more than just the CPU
         | (combined TDP of 2x225W [0]). You also have to consider SSD
         | (16x20W when fully loaded [1]), NIC (4x24W [2]), and the rest
         | of the actual system itself (e.g. cooling, backplane).
         | 
         | [0]
         | https://www.amd.com/en/products/processors/server/epyc/7003-...
         | 
         | [1] I couldn't find 14TB enterprise SSDs on Intel's website, so
         | I'm using the numbers from 6.4TB drives:
         | https://ark.intel.com/content/www/us/en/ark/products/202708/...
         | 
         | [2] I'm not sure offhand which model number to use, but both
         | models that support 200GbE on page 93-96 have this maximum
         | wattage: https://docs.nvidia.com/nvidia-connectx-6-dx-ethernet-
         | adapte...
         | 
         | Or, you can skip all the hand calculations and just fiddle with
         | Dell's website to put together an order for a rack while trying
         | to mirror the specs as closely as possible (I only included two
         | NICs, since it complained that the configuration didn't have
         | enough low-profile PCIe slots for four):
         | 
         | https://www.dell.com/en-us/shop/dell-poweredge-servers/power...
         | 
         | In this case, I selected a 1100W power supply and it's still
         | yelling at me that it's not enough; 1400W is enough to make
         | that nag go away.
        
           | throwbigdata wrote:
           | That's not how TDP works.
        
             | vitus wrote:
             | You're not wrong, but it's still a nominally usable lower
             | bound for the actual power draw of the chip, and a
             | reasonable proxy for how much heat you need to dissipate
             | via your cooling solution.
        
           | ksec wrote:
           | Well I am assuming Memory and SSD being the same. The only
           | different should be CPU + NIC since Bluefield itself is the
           | NIC. May be Drewg123 could expand on that ( if he is allowed
           | to )
        
             | vitus wrote:
             | That is a fair point, as the 2x CPU + 4x NIC are "only"
             | about 550W put together. There's probably more overhead for
             | cooling (as much as 40% of the datacenter's power --
             | multiplying by 1.5x pushes you just over that 800W number).
             | 
             | That said, being able to do 800G in an 800W footprint
             | doesn't automatically mean that you can drive 100G in a
             | 100W footprint. Not every ISP needs that 800G footprint, so
             | being able to deploy smaller nodes can be an advantage.
             | 
             | Also: I was assuming that 100W was the whole package (which
             | is super impressive if so), since the Netflix serving model
             | should have most of the SSDs in standby most of the time,
             | and so you're allowed to cheat a little bit in terms of
             | actual power draw vs max rating of the system.
        
       | walteweiss wrote:
       | Is it part of some talk? Is it available online?
        
         | phoronixrly wrote:
         | Here it is: https://www.youtube.com/watch?v=q4TZxj-Dq7s
        
       | jsnell wrote:
       | > Had we moved between -stable branches, bisecting 3+ years of
       | changes could have taken weeks
       | 
       | Would it really? Going by their number of 4 hours per bisect
       | step, you get 6 bisections per day, which cuts the range to
       | 1/64th of the original. The difference between "three years of
       | changes" and "three weeks of changes" is a factor of 50x. I.e.
       | within one day, they'd already have identified the problematic
       | three week range. After that, the remaining bisection takes
       | exactly as long as this bisection did.
       | 
       | Even if they're limited to doing the bisections just doing the
       | working hours in one timezone for some reason, you'd still get
       | those six bisection steps done in just three days. It still would
       | not add weeks.
        
         | iainmerrick wrote:
         | That's a very good point, but merging in three years of changes
         | would also have pulled in a lot of minor performance changes,
         | and possibly some incompatible changes that would require some
         | code updates. That would slow down each bisection step, and
         | also make it harder to pinpoint the problem.
         | 
         | If you know that some small set of recent changes caused a 7%
         | regression, you can be fairly confident there's a single cause.
         | If 3 years of updates cause a 6% or 8% regression (say), it's
         | not obvious that there's going to be a single cause. Even if
         | you find a commit that looks bad, it might already have been
         | addressed in a later commit.
         | 
         |  _Edit to clarify:_ you 're technically correct (the best kind
         | of correct!) but I'd still much prefer to merge 3 weeks rather
         | than 3 years, even though their justification isn't quite
         | right.
        
         | wccrawford wrote:
         | I take your point, but that assumes someone working 24 hour
         | days, or constantly handing off the project to at least 2 other
         | people _every day_.
         | 
         | I don't think those are reasonable work scenarios, so it's more
         | like 2 bisects (maybe 3!) per day, rather than 6.
        
           | jsnell wrote:
           | Not really, it just assumes that the bisection process is
           | automated.
           | 
           | But also, I addressed exactly this objection in the second
           | paragraph :)
        
             | jonhohle wrote:
             | If a bisection takes a day, it would probably take longer
             | to automate than just find it manually. For performance
             | bugs, you may need to look at non-standard metrics or
             | profiling that would otherwise be a one-off and don't
             | necessarily make sense to automate.
        
               | jsnell wrote:
               | The full bisection taking just a day doesn't seem
               | compatible with the parameters of the story.
               | 
               | Three weeks of FreeBSD changes seems to be about 500
               | commits. That's about 9 bisection steps. At two steps /
               | day (the best you can do without automation), that's a
               | full work week. It seems obvious that this is worth
               | automating.
        
             | yencabulator wrote:
             | Bisection and testing might be automated, but resolving
             | merge conflicts isn't.
        
         | drewg123 wrote:
         | Author here: Note that the 4 hours per bisection is the time to
         | ramp the server up and down and re-image it. It does not count
         | the time to actually do the bisection step. That's because in
         | our case of following head, the merging & conflicts were
         | trivial for each step of the 3 week bisection. Eg, the
         | bisections we're doing are far simpler than the bisections we'd
         | be doing if we didn't track the main branch, and had a far
         | larger patchset that had to be re-merged at each bisection
         | point. I'm cringing just imagining the conflicts.
        
           | crote wrote:
           | Does "the server" imply you're only using a single server for
           | this? I would have expected that at Netflix's scale it
           | wouldn't be _too_ difficult to do the whole bisect-and-test
           | on a couple dozen servers in parallel.
        
             | generalizations wrote:
             | Wouldn't each bisect depend on the ones before it? So you
             | can't ramp up the next test before you finish the prior.
        
               | toast0 wrote:
               | You could speculate in both directions, and test three
               | revisions in parallel... However, if there's a lot of
               | manual merging to do, you might not want to do the extra
               | merge that's involved there. You might also get nerd
               | sniped into figuring out if there's a better pattern if
               | you're going to test multiple revisions in parallel.
        
         | zellyn wrote:
         | lol, came here to say this, armed with identical log2 53 == 5.7
         | :-) The replies to your comment are of course, spot on, though.
         | Finding 8% of performance regression in three years of commits
         | could have taken a looooong time.
        
       | hnarayanan wrote:
       | Going through these slides was like reading a thriller!
        
         | p0seidon wrote:
         | I agree, but: The best thing about this is that the work is
         | actually done by a small number of people instead of an
         | overengineered system with a custom-solution. The thriller is
         | great, but also the efficiency powering all of Netflix CDNs is
         | refreshing.
        
       | gigatexal wrote:
       | Have they ever talked about why the content isn't stored/read
       | from ZFS just the root pool?
        
         | iv42 wrote:
         | Yes. Because sendfile(2) is not zero-copy on ZFS.
        
         | shrubble wrote:
         | Yes, it is because they can do zero-copy sending of content
         | that they can't do under ZFS (yet). Some links to Netflix
         | papers and video talks on this older Reddit thread:
         | https://www.reddit.com/r/freebsd/comments/ltjv8m/zfs_is_over...
        
           | gigatexal wrote:
           | bravo thank you for that
        
         | 8fingerlouie wrote:
         | Besides not being able to do zero copy on ZFS (yet), it
         | probably also have to do with them not using RAID for content,
         | and single drive ZFS doesn't make much sense in that scenario.
         | 
         | Single drive ZFS brings snapshots and COW, as well as bitrot
         | detection, but for Netflix OCAs, snapshots are not used, and
         | it's mostly read-only content, so not much use for COW, and
         | bitrot is less of a problem with media. Yes, you may get a
         | corrupt pixel every now and then (assuming it's not caught by
         | the codec), but the media will be reseeded again every n days,
         | so the problem will solve itself.
         | 
         | I assume they have ample redundancy in the rest of the CDN, so
         | if a drive fails, they can simply redirect traffic to the next
         | nearest mirror, and when a drive is eventually replaced, it
         | will be seeded again by normal content seeding.
        
           | _zoltan_ wrote:
           | yet? you write "yet" as it's something that would be almost
           | readily available, yet it's been at least 2 years now and
           | it's still not there.
           | 
           | am I missing something or that "yet" is more like "maybe
           | sometime if ever"?
        
             | toast0 wrote:
             | Netflix has been developing in kernel and userland for
             | this; if ZFS for content was a priority, they could make
             | 0-copy sendfile work. Yes, it's not trivial at all. It will
             | probably happen eventually, by someone who needs 0-copy and
             | ZFS; or by ZFS people who want to support more use cases.
        
             | sophacles wrote:
             | Yet does not suggest any sort of impending completion.
             | 
             | We haven't been to the center of the galaxy yet.
             | 
             | We haven't achieved immortality yet.
             | 
             | Both are valid sentences without any fixed timeline, and in
             | the case of the first, a date that is hundreds of thousands
             | of years in the future at soonest.
             | 
             | "yet" just means "up until this time" (in conext, sometimes
             | it means "by the time you're talking about" - e.g. I'm
             | scheduled on-call on the 13th, but i won't be back from my
             | PTO yet).
        
         | drewg123 wrote:
         | In addition to the lack of zero-copy sendfile from ZFS, we also
         | have the problem the ZFS is also lacking async sendfile support
         | (eg, the ability to continue the sendfile operation from the
         | disk interrupt handler when the blocks arrive from disk).
        
       | phoronixrly wrote:
       | Video from the talk at OpenFest 2023:
       | https://www.youtube.com/watch?v=q4TZxj-Dq7s
        
       | eatonphil wrote:
       | Competition is good (not everyone using Linux I mean), and I've
       | ran FreeBSD on my desktop and server for a few years.
       | 
       | But whenever Netflix's use of FreeBSD comes up I never come away
       | with a concrete understanding for: is the cost/performance
       | Netflix gets with FreeBSD really not doable on Linux?
       | 
       | I'm trying to understand if it's inertia, or if not, why more
       | cloud companies with similar traffic (all the CDN companies for
       | example) aren't also using FreeBSD somewhere in their stack.
       | 
       | If it were just the case that they like FreeBSD that would be
       | fine and I wouldn't ask more. But they mention in the slides
       | FreeBSD is a performance-focused OS which seems to beg the
       | question.
        
         | Thaxll wrote:
         | The truth is that some senior engineer at Netflix chose to use
         | FreeBSD and they stick to that idea since then, FreeBSD is not
         | better it's happen to be the solution they chose.
         | 
         | All the benefits they added to FreeBSD could be added the same
         | way in Linux if it was missing.
         | 
         | YouTube / Google CDN is much bigger than Netflix and runs 100%
         | on Linux, you can make pretty much everything work on modern
         | solution / languages / framework.
        
           | inopinatus wrote:
           | Sorry, this is seriously misinformed:
           | 
           | > YouTube / Google CDN is much bigger than Netflix
           | 
           | Youtube and Netflix are on par. According to Sandvine,
           | Netflix sneaked past Youtube in volume in 2023[1]. I believe
           | their 2024 report shows them still neck-and-neck.
           | 
           | > you can make pretty much everything work on modern solution
           | 
           | Presenting a false equivalence without evidence is not
           | convincing. "You could write it in SNOBOL and deploy it on
           | TempleOS". Netflix didn't choose something arbitrary or by
           | mistake. They chose one of the world's highest performing and
           | most robustly tested infrastructure operating systems. It's
           | the reason a FreeBSD derivative lies at the core of Juniper
           | routers, Isilon and Netapp storage heads, every Playstation
           | 3/4/5, and got mashed into NeXTSTEP to spawn Darwin and
           | thence macOS, iOS etc.
           | 
           | It continues to surprise me how folks in the tech sector
           | routinely fail to notice how much BSD is deployed in the
           | infrastructure around them.
           | 
           | > All the benefits they added to FreeBSD could be added the
           | same way in Linux
           | 
           | They cannot. Case in point, the bisect analysis described in
           | the presentation above doesn't exist for Linux, where
           | userland distributions develop independently from the kernel.
           | Netflix is touting here the value of FreeBSD's unified
           | release, since the bisect process fundamentally relies on a
           | single dimension of change (please ignore the mathematicians
           | muttering about Schur's transform).
           | 
           | [1] https://www.sandvine.com/press-
           | releases/sandvines-2023-globa...
        
             | davidw wrote:
             | > There's a reason it's the core of Juniper routers, Isilon
             | and Netapp storage heads, every Playstation 3/4/5, and got
             | mashed into NeXTSTEP to spawn macOS.
             | 
             | Licensing?
             | 
             | FreeBSD is a fine OS that surely has some advantages here
             | and there, but I'm also inclined to think that big
             | companies can make stuff work if they want to.
             | 
             | PHP at Meta seems like a pretty good example of this.
        
               | inopinatus wrote:
               | The permissive, freewheeling nature of the BSD license is
               | touted by some as an advantage for infrastructure use but
               | in practice, Linux-based devices and services aren't
               | particularly encumbered by the GPL, so to me it's a wash.
        
               | davidw wrote:
               | Could be lawyers at some companies just didn't want
               | anything to do with the GPL, especially if they're
               | fiddling with the kernel itself. Maybe they're not even
               | correct about the details of it, just fearful. "Lawyers
               | overly cautious" is not an uncommon thing to see.
        
               | bluGill wrote:
               | Having seen the screams from some people when you use GPL
               | legally I can't blame the lawyers. You might be correct
               | but you will still get annoying people yelling you are
               | not. there is also the cost to verifying you are correct
               | (we put the wrong address in for where to send you
               | request for sorce, fortunately a tester caught it before
               | release, but still expensive as the lawyers forced the
               | full recall process meaning we had to send a tech to
               | upgrade the testers instead of saying where to download
               | the fix)
        
               | throw0101d wrote:
               | > [...] _Linux-based devices and services aren 't
               | particularly encumbered by the GPL, so to me it's a
               | wash._
               | 
               | Linux uses GPL 2.x. If we are talking about GPL 3.x
               | things may be different:
               | 
               | > _One major danger that GPLv3 will block is tivoization.
               | Tivoization means certain "appliances" (which have
               | computers inside) contain GPL-covered software that you
               | can 't effectively change, because the appliance shuts
               | down if it detects modified software._
               | 
               | * https://www.gnu.org/licenses/rms-why-gplv3.en.html
               | 
               | If you want to ship (embedded) GPL 2 Linux-based devices
               | things are fine, but if there's software that is GPL 3
               | then that can cause issues. Linus on GPLv3:
               | 
               | * https://www.youtube.com/watch?v=PaKIZ7gJlRU
        
               | cheema33 wrote:
               | > Licensing?
               | 
               | I think you've hit the nail on the head. I used to work
               | for a very very large company. And they had a very strong
               | preference for BSD licensed software, if we were building
               | something using outside software. A few years ago, Stever
               | Balmer and others spent a lot of time calling spreading
               | FUD by calling Linux and GPL a "cancer". Believe it or
               | not, that stuff had a massive impact. Not on small shops,
               | but large company lawyers.
               | 
               | Over the years, Steve left Microsoft and Microsoft has
               | become a lot more Linux friendly. And the cancer fear has
               | subsided. But it was very very real.
               | 
               | On a side note, if I recall correctly, Steve Jobs wanted
               | to use Linux for MacOS. But he had licensing concerns and
               | called up Linus and asked him to change a few things.
               | Linus gave him the middle finger and that is how we got
               | MacOS with BSD.
        
               | SoftTalker wrote:
               | I doubt this story about Linus and Steve. MacOS is
               | NEXTSTEP-derived and that pre-dated Linux by several
               | years.
        
               | kelsey98765431 wrote:
               | Fun fact! NeXT was a commercial att unix fork. The
               | transition from the unix base to the BSD base did infact
               | happen at apple after the acquisition. The value of next
               | was in its application library, which would eventually
               | become the mac foundation libraries like coreaudio and
               | cocoa etc. The earliest releases of Rhapsody are very
               | illuminative about the architecture of XNU/OSX. I don't
               | doubt that linux was considered. There's a specific time
               | when the actual move of rhapsody to a freebsd base
               | occurred and it was at apple sometime in 97 or 98 iirc.
        
               | teddyh wrote:
               | NeXTSTEP was always BSD4.3-Tahoe Unix, not AT&T Unix.
        
               | inopinatus wrote:
               | Be all that as it may, this is still a "why not Linux"
               | line of thinking, rather than "why FreeBSD", which is the
               | more interesting question. And it is not a binary choice.
        
             | toast0 wrote:
             | > Incorrect. Case in point, the bisect analysis described
             | in the presentation above doesn't exist for Linux, where
             | userland distributions develop independently from the
             | kernel.
             | 
             | You can certainly bisect the Linux kernel changes though.
             | And the bug in question was a kernel bug. For a project
             | like this, IMHO, most of the interesting bugs are going to
             | be kernel bugs.
        
               | inopinatus wrote:
               | Perhaps so, but this knowledge is _post hoc_ for the
               | incident and does not undermine the engineering value of
               | the unified release.
        
               | toast0 wrote:
               | Probably, if you were doing this on Linux, you'd follow
               | someone's Linux tree, and not run Debian unstable with
               | updates to everything.
               | 
               | You might end up building a lightweight distribution
               | yourself; some simple init, enough bits and bobs for in
               | place debugging, and the application and its monitoring
               | tools.
               | 
               | Anyway, if you did come across a problem where userland
               | and kernel changed and it wasn't clear where the breakage
               | is, the process is simple. Test new kernel with old
               | userland and old kernel with new userland, and see which
               | part broke. Then bisect if you still can't tell by
               | inspection.
        
             | forgotpwd16 wrote:
             | >It's the reason a FreeBSD derivative
             | 
             |  _A_ reason. Licensing also plays a role. Some may say the
             | most important one.
             | 
             | >Case in point
             | 
             | Not really. This is an advantage, yes, but was inherent to
             | BSD development style. Not an addition they did. Assume GP
             | refers to other presentations, talking about the
             | optimizations needed to get the performance Netflix needs
             | from FreeBSD.
             | 
             | >doesn't exist for Linux
             | 
             | But can be done by putting kernel and userland in same repo
             | as modules.
        
           | secondcoming wrote:
           | 'runs 100% on Linux' is a bit vague. What customisations do
           | they do?
        
             | yencabulator wrote:
             | Google has ~9000 kernel patches, and a fairly custom
             | userspace. Saving 2% at that scale is _huge_.
        
         | hiAndrewQuinn wrote:
         | Might be a manpower thing. By hiring a bunch of FreeBSD Core
         | devs, Netflix might be able to get a really talented team for
         | cheaper than they might get a less ideologically flavored team.
         | (I say this as I set up i3 on FreeBSD on my Thinkpad X280, I'm
         | a big fan!)
        
           | IshKebab wrote:
           | They also get much more control. If they employ most of the
           | core FreeBSD devs they basically have their own OS that they
           | can do what they like with. If they want some change that
           | benefits their use case at the detriment of other people they
           | can pretty much just do it.
           | 
           | That's not really possible with Linux.
        
             | guenthert wrote:
             | The flip side is of course that in this scenario they would
             | have to finance most improvements. In Linux land the cost
             | of development is shared across many, which one might
             | expect to yield a _generally_ better product.
        
               | toast0 wrote:
               | They're at the cutting edge. They're going to have to
               | finance the development of most of this stuff anyway.
        
               | yencabulator wrote:
               | _This_ development, but not all of the other things that
               | go into a kernel and distro. If they pocket the whole
               | core team, then they end up paying for all of the work,
               | not just the few optimizations they really care about.
        
               | throw0101d wrote:
               | > _In Linux land the cost of development is shared across
               | many, which one might expect to yield a_ generally
               | _better product._
               | 
               | And you have more cooks in the kitchen tweaking things
               | that you also want to work on, so there's a higher chance
               | of simply conflicts, but also folks that want to go in a
               | completely different direction
               | technically/philosophically.
        
               | kelsey98765431 wrote:
               | When netflix was founded the only viable commercial linux
               | vendor was rhel and the support contract would have been
               | about the same as just hiring the fbsd core team at
               | salary.
               | 
               | People really do not remember the state of early linux.
               | Raise a hand if you have ever had to compile a kernel
               | module from a flash drive to make a laptops wifi work,
               | then imagine how bad it was 20 years before you tried and
               | possibly failed at getting networking on linux to work,
               | let alone behave in a consistent manner.
               | 
               | The development costs were not yet shared back then, most
               | of the linux users at the vendor support level were still
               | smaller unproven businesses and most importantly if you
               | intend to build something that nobody else has you do not
               | exactly want to be super tied to a large community that
               | can just take your work and directly go clone your
               | product.
               | 
               | Hiring foss devs and putting them under NDA for
               | everything they write that doesn't get upstreamed is an
               | excellent way to get nearly everything upstreamed aswell,
               | and the cost of competitors then porting these merged
               | upstream changes back down into their linux is not
               | nothing, so this gives a competitive moat advantage.
        
               | toast0 wrote:
               | > When netflix was founded the only viable commercial
               | linux vendor was rhel and the support contract would have
               | been about the same as just hiring the fbsd core team at
               | salary.
               | 
               | AFAIK, currently, Netflix only uses FreeBSD for their CDN
               | appliances; their user facing frontend and apis live (or
               | lived) in AWS on Linux. I don't know what they were
               | running on before they moved to the cloud.
               | 
               | I don't think they started doing streaming video until
               | 2007 and they didn't start deploying CDN nodes until 2011
               | [1]. They started off with commercial CDNs for video
               | distribution. I don't know what the linux vendor
               | marketplace looked like in 2007-2011, but I'm sure it
               | wasn't as niche as in 1997 when Netflix was founded. I
               | think they may have been using Linux for production at
               | the time that they decided to use FreeBSD for CDN
               | appliances.
               | 
               | > Hiring foss devs and putting them under NDA for
               | everything they write that doesn't get upstreamed is an
               | excellent way to get nearly everything upstreamed aswell,
               | and the cost of competitors then porting these merged
               | upstream changes back down into their linux is not
               | nothing, so this gives a competitive moat advantage.
               | 
               | I don't think Netflix is particularly interested in a
               | software moat; or they wouldn't be public about what they
               | do and how, and they wouldn't spend so much time
               | upstreaming their code into FreeBSD. There's an argument
               | to be made that upstreaming reduces their total effort,
               | but that's less true the less often you merge. Apple
               | almost never merges in upstream changes from FreeBSD back
               | into mac os; so not bothering to upstream their changes
               | saves them a lot of collaborative work at the cost of
               | making an every 10 years process a little bit longer.
               | 
               | At WhatsApp, I don't think we ever had more than 10
               | active patches to FreeBSD, and they were almost all tiny;
               | it wasn't a lot of effort to port those forward when we
               | needed to, and certainly less effort than sending and
               | following up on getting changes upstream. We did get a
               | few things upstreamed though (nothing terribly
               | significant IMHO; I can remember some networking things
               | like fixing a syncookie bug that got accidentally
               | introduced and tweaking the response for icmp needs frag
               | to not respond when the requested mtu was at or bigger
               | than the current value; nothing groundbreaking like async
               | sendfile or kTLS).
               | 
               | [1] https://web.archive.org/web/20121021050251/https://si
               | gnup.ne...
        
               | yencabulator wrote:
               | Netflix, the online service, launched in 2007. The
               | previous business, doing DVD mailing, had no such high
               | bandwidth serving requirements. You're exaggerating the
               | timeline quite a lot.
        
         | toast0 wrote:
         | It's hard to do apples to apples comparison, because you'd need
         | two excellent, committed teams working on this.
         | 
         | I'm a FreeBSD supporter, but I'm sure you _could_ get things to
         | work in Linux too. I haven 't seen any posts like 'yeah, we got
         | 800 gbps out of our Linux CDN boxes too', but I don't see a lot
         | of posts about CDN boxes at all.
         | 
         | As someone else downthread wrote, using FreeBSD gives a lot of
         | control, and IMHO provides a much more stable base to work
         | with.
         | 
         | Where I've worked, we didn't follow -CURRENT, and tended to
         | avoid .0 releases, but it was rare to see breakage across
         | upgrades and it was typically easy to track down what changed
         | because there usually wasn't a lot of changes in the suspect
         | system. That's not really been my experience with Linux.
         | 
         | A small community probably helps get their changes upstreamed
         | regularly too.
        
         | bluedino wrote:
         | Are there papers out there from other companies that detail
         | what performance levels have been achieved using Linux in a
         | similar device to the Netflix OCA? Maybe they just use two
         | devices that have 90% of the performance?
        
         | bbatha wrote:
         | My understanding was that Netflix liked FreeBSD for a few
         | reasons, some of them more historical than others.
         | 
         | * The networking stack was faster at the time
         | 
         | * dtrace
         | 
         | * async sendfile(2) https://lists.freebsd.org/pipermail/svn-
         | src-head/2016-Januar...
         | 
         | Could they have contributed async sendfile(2) to linux as well?
         | Probably. In 2024 these advantages seem to be moot: ebpf,
         | io_uring, more maturity in the linux network stack plus FreeBSD
         | losing more and more vendor support by the day .
        
         | TheCondor wrote:
         | The community is such that if one of either FreeBSD or Linux
         | really outperformed the other in some capacity, they'd rally
         | and get the gap closed pretty quickly. There are benchmarks
         | that flatter both but they tend to converge and be pretty
         | close.
         | 
         | Netflix has a team of, I think it was 10 from the PDF,
         | engineers working on this platform. A substantial number of
         | them are FreeBSD contributors with relationships to it. It's a
         | very special team. That's the difference maker here. If it was
         | a team of former Red Hat guys, I'm sure they'd be on Linux. If
         | it was a team of old Solaris guys, I wouldn't be surprised if
         | they were on one of the Solaris offsprings. Then, Netflix knew
         | that at their scale and to make it work, they had to build this
         | stuff out. That was something they figured out pretty quickly,
         | they found the right team and then built around them. It's far
         | more sophisticated than "we replaced RedHat Enterprise with
         | FreeBSD and reaped immediate performance advantages." At their
         | scale, I don't think there is an off the shelf solution.
        
           | kelsey98765431 wrote:
           | I think you're close but missing an element. FreeBSD is a
           | centrally designed and implemented OS from kernel to libc to
           | userland. The entire OS is in the source tree, with updated
           | and correct documentation, with coherent design.
           | 
           | Any systems level team will prefer this form of tightly
           | integrated solution to the systems layer problem if they are
           | responsible for the highly built out specialized distributed
           | application we call netflix. The reasons for design choices
           | going all the way back to freebsd 1 are available on the
           | mailing list in a central place. Everything is there.
           | 
           | Trying to maintain your own linux distro is insanely
           | difficult, infact google still uses a 2.2 kernel with all
           | custom back ported updates of the last 30 years.
           | 
           | The resources to match the relatively small freebsd design
           | and implementation team are minuscule compared to the
           | infinite crawl of linuxes, a team of 10 freebsd sysprogs is
           | basically the same amount of people responsible for designing
           | and building the entire thing.
           | 
           | It comes down to the sources of truth. In the world of fbsd
           | that's the single fbsd repo and the single mailing list. For
           | a linux, that could be thousands of sources of truth across
           | hundreds of platforms.
        
             | TheCondor wrote:
             | I understand the point you're trying to make and I agree
             | that FreeBSD tends to have cleaner code and better
             | documentation at different levels but I don't think that it
             | makes it that much more difficult. If you dropped in from a
             | different world and had zero experience then I think you're
             | right and a systems team would almost always pick BSD. Any
             | actual experience pretty quickly swings it the other way
             | though; there are also companies dedicated to helping you
             | fill in those gaps on the Linux side.
             | 
             | I've built a couple embedded projects on Linux, when you're
             | deep on a hard problem, the mailing lists and various
             | people are nice, but the "source of truth" is your logic
             | analyzer and you debug the damn thing. Or your hardware
             | vendor might have some clues as they know more of the bugs
             | in their hardware.
        
               | kelsey98765431 wrote:
               | Fair points taken, i was a bit zealous in my use of the
               | word any, the word many is more correctly applicable.
               | 
               | In regard to sources of truth, i mean from the design
               | considerations point of view. For instance, why does the
               | scheduler behave a certain way? We can analyze the logic
               | and determine the truth of the codes behavior but
               | determining the reason for the selection of that design
               | for implementation is far more difficult.
               | 
               | These days yes, off the shelf linux will do just fine at
               | massively scaling an application. When netflix started
               | building blockbuster was still a huge scary thing to be
               | feared and respected. Linux was still a hobby project if
               | you didn't fork out 70% of a commercial unix contract
               | over to rhel.
               | 
               | The team came in with the expectation and understanding
               | they would be making massive changes for their
               | application that may never be merged upstream. The
               | chances of getting a merge in are higher if the upstream
               | is a smaller centralized team. You can also just ask the
               | person who was in charge of the, let's say for example,
               | the init system design and implementation. Or oh, that
               | scheduler! Why does it deprioritize x and y over z, is
               | there more to why this was chosen than what is on the
               | mailing list?
               | 
               | The pros go on and on and the cons are difficult to
               | imagine, unless you make a vacuum and compare 2024 linux
               | to 2004 linux.
        
             | kstrauser wrote:
             | > infact google still uses a 2.2 kernel with all custom
             | back ported updates of the last 30 years.
             | 
             | Say what?
        
               | mbilker wrote:
               | They are trying to get off or have gotten off their
               | kernel fork called "Prodkernel" for some time now.
               | 
               | https://lwn.net/Articles/871195/
               | https://events.linuxfoundation.org/wp-
               | content/uploads/2022/0...
        
               | yencabulator wrote:
               | And ProdKernel generally lags mainstream by only a few
               | years, as can be seen in the things you linked to.
               | 
               | The change being talked here is moving from merge ~2
               | years to merge all the time. Saying they're stuck on
               | something from 1999 is ridiculous.
        
             | yencabulator wrote:
             | > infact google still uses a 2.2 kernel with all custom
             | back ported updates of the last 30 years.
             | 
             | Linux 2.2 is from 1999. It can barely do SMP. That's pretty
             | much a crazy person claim.
             | 
             | Googlers say ProdKernel is merged forward every few years:
             | 
             | > Every two years or so, those patches are rebased onto a
             | newer kernel version, which provides a number of
             | challenges.
             | 
             | https://lwn.net/Articles/871195/
             | 
             | > Every ~2 years we rebase all these patches over a ~2 year
             | codebase delta
             | 
             | https://events.linuxfoundation.org/wp-
             | content/uploads/2022/0... (and many other places)
        
             | arp242 wrote:
             | Practically this just doesn't matter all that much. You can
             | prefer one approach to the other and that's all fine, but
             | from a "serious engineering" perspective it just doesn't
             | really matter.
             | 
             | > Trying to maintain your own linux distro is insanely
             | difficult, infact google still uses a 2.2 kernel with all
             | custom back ported updates of the last 30 years.
             | 
             | 2.2? Colour me skeptical on that.
             | 
             | But it's really not that hard to make a Linux distro. I
             | know, because I did it, and tons of other people have. It's
             | little more than "get kernel + fuck about with Linux
             | booting madness + bunch of userland stuff". The same
             | applies to FreeBSD by the way, and according to the PDF
             | "Netflix has an internal "distro"".
             | 
             | The problems Google has is because they maintain extensive
             | patchsets, not because they built their own distro. They
             | would have the same problems with FreeBSD, or any other
             | complex piece of software.
        
       | Thaxll wrote:
       | I'm still surprised they did not get away from Nginx at that
       | point and did something like Cloudflare.
        
         | alberth wrote:
         | When you're pushing the amount of data Netflix is, you need to
         | work directly with the ISP & Exchanges.
         | 
         | At least that's my guess.
        
           | yencabulator wrote:
           | Parent is talking purely about software.
           | 
           | https://blog.cloudflare.com/how-we-built-pingora-the-
           | proxy-t...
        
         | toast0 wrote:
         | Do you mean moving away from using Nginx, like Cloudflare moved
         | to a custom replacement? [1]
         | 
         | I don't think that's as needed for Netflix. My understanding is
         | their CDN nodes are using Nginx to serve static files --- their
         | content management system directs off peak transfers, and
         | routes clients to nodes that it knows has the files. They don't
         | run a traditional caching (reverse) proxy, and they most likely
         | don't hit a lot of the things Cloudflare was hitting, because
         | their use case is so different.
         | 
         | (I haven't seen a discussion of how Netflix handles live
         | events, that's clearly a different process than their
         | traditional prerecorded media)
         | 
         | [1] https://blog.cloudflare.com/how-we-built-pingora-the-
         | proxy-t...
        
           | drewg123 wrote:
           | Yes, exactly.
        
           | Thaxll wrote:
           | Yes I'm talking about Nginx running on those bsd boxes, they
           | have such a custom design that writing your own static file
           | process would have made sense.
        
             | toast0 wrote:
             | Getting an HTTP server to work with the diversity of HTTP
             | clients in the world that Netflix supports is not going to
             | be fun, and NGINX is right there.
             | 
             | As I understand it, they've made some changes to NGINX, but
             | I don't think they've made a lot, and I don't think
             | anything where the structure of NGINX was not conducive to
             | it or limiting.
             | 
             | I'm not one to shy away from building a custom design, but
             | it's a lot easier when you control the clients, and Netflix
             | has to work with everything from browsers to weird SDKs in
             | TVs.
             | 
             | Netflix OCA performance seems mostly bottlenecked on
             | I/O/memory bandwidth (and cpu overhead for pacing?) and any
             | sensible HTTP server for static files isn't going to use a
             | lot of memory bandwidth processing inbound requests and
             | calling sendfile on static files. So why spend the limited
             | time of a small team to build something that's not going to
             | make a big difference.
        
       | ay wrote:
       | Sometimes it's still hard to tackle people psychology who are
       | used to "comfort" of the "sta[b]le" branches.
       | 
       | So at work I came up with the following observation: if you are a
       | consumer and are afraid of unpredictable events at head of
       | master/main branch - if you use the head of master/main from 21
       | days ago, you have 3 weeks of completely predictable and
       | _modifiable_ future.
       | 
       | Any cherry picks made are made during the build process, so there
       | is no branch divergence - if the fix gets up streamed it is not
       | cherry picked anymore.
       | 
       | Thus, unlike with stable branches, by default it converges back
       | to master.
       | 
       | "But what if the fix is not upstreamed ?" - then it gets to be
       | there, and depending on the nature of the code it bears a bigger
       | or smaller ongoing cost - which reflects well the technical debt
       | as it is.
       | 
       | This has worked pretty well for the past 4-5 years and is now
       | used for quite a few projects.
        
         | bongodongobob wrote:
         | This is how OS updates have worked at every company I've been
         | at. Either have a handful of devices that get them immediately
         | and you scream test, or you simply just wait 3 weeks, then roll
         | them out. (Minus security CVE's of course)
        
       | kazinator wrote:
       | What are they bisecting with? FreeBSD uses CVS and Perforce.
       | 
       | These items in the slide don't add up:
       | 
       | > _Things had worked accidentally for years, due to linkerset
       | alphabetical ordering_
       | 
       | In other words, the real bug is many years old. Yet, on the last
       | slide:
       | 
       | > _Since we found the bug within a week or so of it hitting the
       | tree, the developer responsible was incredibly responsive. All
       | the details were fresh in his memory._
       | 
       | What? Wasn't what hit the tree the ordering change which exposed
       | the bug? It doesn't seem like that being fresh in anyone's mind
       | is of much help.
        
         | sakjur wrote:
         | FreeBSD has moved to Git as its primary VCS, see
         | https://lists.freebsd.org/pipermail/freebsd-current/2020-Dec...
        
         | drewg123 wrote:
         | FreeBSD hasn't use CVS and P4 in decades. FreeBSD uses git
         | internally, and has a github mirror. See
         | https://docs.freebsd.org/en/articles/committers-guide/
         | 
         | The unintentional ordering change being fresh Colin's memory
         | was really helpful, as he quickly realized that he'd actually
         | changed the ordering and sent me a patch to fix it. If it had
         | been 3 years in the past, I suspect he would have been less
         | responsive (I know I would have been).
        
         | pronoiac wrote:
         | There were two sets of bugs:
         | 
         | * the new sort handled ties differently. They adjusted that,
         | and they _could_ have stopped there.
         | 
         | * the other was that the correct drivers were sensitive to
         | loading order. They handled this and another driver bug,
         | amdtemp. This was masked by the old sort.
         | 
         | If they'd found this years later, even investigating the first
         | set would be a slower process - it wouldn't be fresh in minds,
         | and it likely would then have other code relying on it, so
         | adjusting it would be trickier.
        
       | kazinator wrote:
       | Why would you alphabetically order initializations?
       | 
       | Every complex system I've ever worked on that had a large number
       | of initializations was sensitive to orders.
       | 
       | Languages with module support like Wirth's Modula-2 ensure that
       | if module A uses B, B's initialization will execute before A's.
       | If there is no circular dependency, that order will never be
       | _wrong_.
       | 
       | The reverse order could work too, but it's a crapshoot then.
       | Module dependency doesn't logically _entail_ initialization order
       | dependency: A 's initializations might not require B's
       | initializations to have completed.
       | 
       | If you're initializing by explicit calls in a language that
       | doesn't automate the dependencies, the baseline safest thing to
       | do is to call things in a fixed order that is recorded somewhere
       | in the code: array of function addresses, or just an init
       | procedure that calls others directly.
       | 
       | If you sort the init calls, it has to be on some property linked
       | to dependency order, otherwise don't do it. Unless you've encoded
       | something related to dependencies into the module names,
       | lexicographic order is not right.
       | 
       | In the firmware application I work on now, all modules have a
       | statically allocated signature word that is initially zero and
       | set to a pattern when the module is initialized. The external API
       | functions all assert that the pattern has the correct value,
       | which is strong evidence that the module had been initialized
       | before use.
       | 
       | One one occasion I debugged a static array overrun which trashed
       | these signatures, causing the affected modules to assert.
        
         | toast0 wrote:
         | Having a consistent ordering avoids differences in results from
         | inconsistent ordering by construction. IIUC, Alpha sort was/is
         | used as a tie breaker after declared dependencies or other
         | ordering information.
         | 
         | In this case, two (or more) modules indicate they can handle
         | the same hardware and didn't have information on priority if
         | both were present. Probably this should be detected / raise a
         | fault, but under the previous regime of alpha sort, it was
         | handled nicely because the preferred drivers happened to sort
         | first.
        
       | andrewstuart wrote:
       | >> Most of us are FreeBSD committers or contributors
       | 
       | This is why you shouldn't copy this strategy.
        
         | arp242 wrote:
         | Most of us also aren't running Netflix, or anything of the
         | sort. All the big companies that heavily use Linux at scale
         | also employ tons of Linux kernel engineers.
        
       | andrewstuart wrote:
       | I love FreeBSD and in a different world really it should occupy
       | the place Linux does.
       | 
       | But the reality is it's a Linux world and now Linux has systemd
       | which makes switching to anything else not an option for me.
       | 
       | You'd have to pry systemd from my cold, dead hands.
        
       ___________________________________________________________________
       (page generated 2024-06-10 23:01 UTC)