[HN Gopher] The Magical Mystery Merge Or Why we run FreeBSD-curr...
___________________________________________________________________
The Magical Mystery Merge Or Why we run FreeBSD-current at Netflix
(2023) [pdf]
Author : ksec
Score : 206 points
Date : 2024-06-10 06:10 UTC (16 hours ago)
(HTM) web link (people.freebsd.org)
(TXT) w3m dump (people.freebsd.org)
| ksec wrote:
| I posted the Serving Netflix Video Traffic at 800Gb/s and Beyond
| [1] in 2022. For those who are unaware of the context you may
| want to read the previous PDF and thread. Now we have some
| update; quote
|
| > Important Performance Milestones:
|
| 2022: First 800Gb/s CDN server 2x AMD 7713, NIC kTLS offload
|
| 2023: First 100Gb/s CDN server consuming only 100W of power,
| Nvidia Bluefield-3, NIC kTLS offload
|
| My immediate question is if the 2x AMD 7713 actually consumes
| more than 800W of power. i.e More Watts / Gbps. Even if it does,
| it is based on 7nm Zen 3 and DDR4 came out in 2021. Would a Zen 5
| DDR5 outperforms Bluefield in Watts / Gbps?
|
| [1] https://news.ycombinator.com/item?id=32519881
| vitus wrote:
| Note that your power consumption is more than just the CPU
| (combined TDP of 2x225W [0]). You also have to consider SSD
| (16x20W when fully loaded [1]), NIC (4x24W [2]), and the rest
| of the actual system itself (e.g. cooling, backplane).
|
| [0]
| https://www.amd.com/en/products/processors/server/epyc/7003-...
|
| [1] I couldn't find 14TB enterprise SSDs on Intel's website, so
| I'm using the numbers from 6.4TB drives:
| https://ark.intel.com/content/www/us/en/ark/products/202708/...
|
| [2] I'm not sure offhand which model number to use, but both
| models that support 200GbE on page 93-96 have this maximum
| wattage: https://docs.nvidia.com/nvidia-connectx-6-dx-ethernet-
| adapte...
|
| Or, you can skip all the hand calculations and just fiddle with
| Dell's website to put together an order for a rack while trying
| to mirror the specs as closely as possible (I only included two
| NICs, since it complained that the configuration didn't have
| enough low-profile PCIe slots for four):
|
| https://www.dell.com/en-us/shop/dell-poweredge-servers/power...
|
| In this case, I selected a 1100W power supply and it's still
| yelling at me that it's not enough; 1400W is enough to make
| that nag go away.
| throwbigdata wrote:
| That's not how TDP works.
| vitus wrote:
| You're not wrong, but it's still a nominally usable lower
| bound for the actual power draw of the chip, and a
| reasonable proxy for how much heat you need to dissipate
| via your cooling solution.
| ksec wrote:
| Well I am assuming Memory and SSD being the same. The only
| different should be CPU + NIC since Bluefield itself is the
| NIC. May be Drewg123 could expand on that ( if he is allowed
| to )
| vitus wrote:
| That is a fair point, as the 2x CPU + 4x NIC are "only"
| about 550W put together. There's probably more overhead for
| cooling (as much as 40% of the datacenter's power --
| multiplying by 1.5x pushes you just over that 800W number).
|
| That said, being able to do 800G in an 800W footprint
| doesn't automatically mean that you can drive 100G in a
| 100W footprint. Not every ISP needs that 800G footprint, so
| being able to deploy smaller nodes can be an advantage.
|
| Also: I was assuming that 100W was the whole package (which
| is super impressive if so), since the Netflix serving model
| should have most of the SSDs in standby most of the time,
| and so you're allowed to cheat a little bit in terms of
| actual power draw vs max rating of the system.
| walteweiss wrote:
| Is it part of some talk? Is it available online?
| phoronixrly wrote:
| Here it is: https://www.youtube.com/watch?v=q4TZxj-Dq7s
| jsnell wrote:
| > Had we moved between -stable branches, bisecting 3+ years of
| changes could have taken weeks
|
| Would it really? Going by their number of 4 hours per bisect
| step, you get 6 bisections per day, which cuts the range to
| 1/64th of the original. The difference between "three years of
| changes" and "three weeks of changes" is a factor of 50x. I.e.
| within one day, they'd already have identified the problematic
| three week range. After that, the remaining bisection takes
| exactly as long as this bisection did.
|
| Even if they're limited to doing the bisections just doing the
| working hours in one timezone for some reason, you'd still get
| those six bisection steps done in just three days. It still would
| not add weeks.
| iainmerrick wrote:
| That's a very good point, but merging in three years of changes
| would also have pulled in a lot of minor performance changes,
| and possibly some incompatible changes that would require some
| code updates. That would slow down each bisection step, and
| also make it harder to pinpoint the problem.
|
| If you know that some small set of recent changes caused a 7%
| regression, you can be fairly confident there's a single cause.
| If 3 years of updates cause a 6% or 8% regression (say), it's
| not obvious that there's going to be a single cause. Even if
| you find a commit that looks bad, it might already have been
| addressed in a later commit.
|
| _Edit to clarify:_ you 're technically correct (the best kind
| of correct!) but I'd still much prefer to merge 3 weeks rather
| than 3 years, even though their justification isn't quite
| right.
| wccrawford wrote:
| I take your point, but that assumes someone working 24 hour
| days, or constantly handing off the project to at least 2 other
| people _every day_.
|
| I don't think those are reasonable work scenarios, so it's more
| like 2 bisects (maybe 3!) per day, rather than 6.
| jsnell wrote:
| Not really, it just assumes that the bisection process is
| automated.
|
| But also, I addressed exactly this objection in the second
| paragraph :)
| jonhohle wrote:
| If a bisection takes a day, it would probably take longer
| to automate than just find it manually. For performance
| bugs, you may need to look at non-standard metrics or
| profiling that would otherwise be a one-off and don't
| necessarily make sense to automate.
| jsnell wrote:
| The full bisection taking just a day doesn't seem
| compatible with the parameters of the story.
|
| Three weeks of FreeBSD changes seems to be about 500
| commits. That's about 9 bisection steps. At two steps /
| day (the best you can do without automation), that's a
| full work week. It seems obvious that this is worth
| automating.
| yencabulator wrote:
| Bisection and testing might be automated, but resolving
| merge conflicts isn't.
| drewg123 wrote:
| Author here: Note that the 4 hours per bisection is the time to
| ramp the server up and down and re-image it. It does not count
| the time to actually do the bisection step. That's because in
| our case of following head, the merging & conflicts were
| trivial for each step of the 3 week bisection. Eg, the
| bisections we're doing are far simpler than the bisections we'd
| be doing if we didn't track the main branch, and had a far
| larger patchset that had to be re-merged at each bisection
| point. I'm cringing just imagining the conflicts.
| crote wrote:
| Does "the server" imply you're only using a single server for
| this? I would have expected that at Netflix's scale it
| wouldn't be _too_ difficult to do the whole bisect-and-test
| on a couple dozen servers in parallel.
| generalizations wrote:
| Wouldn't each bisect depend on the ones before it? So you
| can't ramp up the next test before you finish the prior.
| toast0 wrote:
| You could speculate in both directions, and test three
| revisions in parallel... However, if there's a lot of
| manual merging to do, you might not want to do the extra
| merge that's involved there. You might also get nerd
| sniped into figuring out if there's a better pattern if
| you're going to test multiple revisions in parallel.
| zellyn wrote:
| lol, came here to say this, armed with identical log2 53 == 5.7
| :-) The replies to your comment are of course, spot on, though.
| Finding 8% of performance regression in three years of commits
| could have taken a looooong time.
| hnarayanan wrote:
| Going through these slides was like reading a thriller!
| p0seidon wrote:
| I agree, but: The best thing about this is that the work is
| actually done by a small number of people instead of an
| overengineered system with a custom-solution. The thriller is
| great, but also the efficiency powering all of Netflix CDNs is
| refreshing.
| gigatexal wrote:
| Have they ever talked about why the content isn't stored/read
| from ZFS just the root pool?
| iv42 wrote:
| Yes. Because sendfile(2) is not zero-copy on ZFS.
| shrubble wrote:
| Yes, it is because they can do zero-copy sending of content
| that they can't do under ZFS (yet). Some links to Netflix
| papers and video talks on this older Reddit thread:
| https://www.reddit.com/r/freebsd/comments/ltjv8m/zfs_is_over...
| gigatexal wrote:
| bravo thank you for that
| 8fingerlouie wrote:
| Besides not being able to do zero copy on ZFS (yet), it
| probably also have to do with them not using RAID for content,
| and single drive ZFS doesn't make much sense in that scenario.
|
| Single drive ZFS brings snapshots and COW, as well as bitrot
| detection, but for Netflix OCAs, snapshots are not used, and
| it's mostly read-only content, so not much use for COW, and
| bitrot is less of a problem with media. Yes, you may get a
| corrupt pixel every now and then (assuming it's not caught by
| the codec), but the media will be reseeded again every n days,
| so the problem will solve itself.
|
| I assume they have ample redundancy in the rest of the CDN, so
| if a drive fails, they can simply redirect traffic to the next
| nearest mirror, and when a drive is eventually replaced, it
| will be seeded again by normal content seeding.
| _zoltan_ wrote:
| yet? you write "yet" as it's something that would be almost
| readily available, yet it's been at least 2 years now and
| it's still not there.
|
| am I missing something or that "yet" is more like "maybe
| sometime if ever"?
| toast0 wrote:
| Netflix has been developing in kernel and userland for
| this; if ZFS for content was a priority, they could make
| 0-copy sendfile work. Yes, it's not trivial at all. It will
| probably happen eventually, by someone who needs 0-copy and
| ZFS; or by ZFS people who want to support more use cases.
| sophacles wrote:
| Yet does not suggest any sort of impending completion.
|
| We haven't been to the center of the galaxy yet.
|
| We haven't achieved immortality yet.
|
| Both are valid sentences without any fixed timeline, and in
| the case of the first, a date that is hundreds of thousands
| of years in the future at soonest.
|
| "yet" just means "up until this time" (in conext, sometimes
| it means "by the time you're talking about" - e.g. I'm
| scheduled on-call on the 13th, but i won't be back from my
| PTO yet).
| drewg123 wrote:
| In addition to the lack of zero-copy sendfile from ZFS, we also
| have the problem the ZFS is also lacking async sendfile support
| (eg, the ability to continue the sendfile operation from the
| disk interrupt handler when the blocks arrive from disk).
| phoronixrly wrote:
| Video from the talk at OpenFest 2023:
| https://www.youtube.com/watch?v=q4TZxj-Dq7s
| eatonphil wrote:
| Competition is good (not everyone using Linux I mean), and I've
| ran FreeBSD on my desktop and server for a few years.
|
| But whenever Netflix's use of FreeBSD comes up I never come away
| with a concrete understanding for: is the cost/performance
| Netflix gets with FreeBSD really not doable on Linux?
|
| I'm trying to understand if it's inertia, or if not, why more
| cloud companies with similar traffic (all the CDN companies for
| example) aren't also using FreeBSD somewhere in their stack.
|
| If it were just the case that they like FreeBSD that would be
| fine and I wouldn't ask more. But they mention in the slides
| FreeBSD is a performance-focused OS which seems to beg the
| question.
| Thaxll wrote:
| The truth is that some senior engineer at Netflix chose to use
| FreeBSD and they stick to that idea since then, FreeBSD is not
| better it's happen to be the solution they chose.
|
| All the benefits they added to FreeBSD could be added the same
| way in Linux if it was missing.
|
| YouTube / Google CDN is much bigger than Netflix and runs 100%
| on Linux, you can make pretty much everything work on modern
| solution / languages / framework.
| inopinatus wrote:
| Sorry, this is seriously misinformed:
|
| > YouTube / Google CDN is much bigger than Netflix
|
| Youtube and Netflix are on par. According to Sandvine,
| Netflix sneaked past Youtube in volume in 2023[1]. I believe
| their 2024 report shows them still neck-and-neck.
|
| > you can make pretty much everything work on modern solution
|
| Presenting a false equivalence without evidence is not
| convincing. "You could write it in SNOBOL and deploy it on
| TempleOS". Netflix didn't choose something arbitrary or by
| mistake. They chose one of the world's highest performing and
| most robustly tested infrastructure operating systems. It's
| the reason a FreeBSD derivative lies at the core of Juniper
| routers, Isilon and Netapp storage heads, every Playstation
| 3/4/5, and got mashed into NeXTSTEP to spawn Darwin and
| thence macOS, iOS etc.
|
| It continues to surprise me how folks in the tech sector
| routinely fail to notice how much BSD is deployed in the
| infrastructure around them.
|
| > All the benefits they added to FreeBSD could be added the
| same way in Linux
|
| They cannot. Case in point, the bisect analysis described in
| the presentation above doesn't exist for Linux, where
| userland distributions develop independently from the kernel.
| Netflix is touting here the value of FreeBSD's unified
| release, since the bisect process fundamentally relies on a
| single dimension of change (please ignore the mathematicians
| muttering about Schur's transform).
|
| [1] https://www.sandvine.com/press-
| releases/sandvines-2023-globa...
| davidw wrote:
| > There's a reason it's the core of Juniper routers, Isilon
| and Netapp storage heads, every Playstation 3/4/5, and got
| mashed into NeXTSTEP to spawn macOS.
|
| Licensing?
|
| FreeBSD is a fine OS that surely has some advantages here
| and there, but I'm also inclined to think that big
| companies can make stuff work if they want to.
|
| PHP at Meta seems like a pretty good example of this.
| inopinatus wrote:
| The permissive, freewheeling nature of the BSD license is
| touted by some as an advantage for infrastructure use but
| in practice, Linux-based devices and services aren't
| particularly encumbered by the GPL, so to me it's a wash.
| davidw wrote:
| Could be lawyers at some companies just didn't want
| anything to do with the GPL, especially if they're
| fiddling with the kernel itself. Maybe they're not even
| correct about the details of it, just fearful. "Lawyers
| overly cautious" is not an uncommon thing to see.
| bluGill wrote:
| Having seen the screams from some people when you use GPL
| legally I can't blame the lawyers. You might be correct
| but you will still get annoying people yelling you are
| not. there is also the cost to verifying you are correct
| (we put the wrong address in for where to send you
| request for sorce, fortunately a tester caught it before
| release, but still expensive as the lawyers forced the
| full recall process meaning we had to send a tech to
| upgrade the testers instead of saying where to download
| the fix)
| throw0101d wrote:
| > [...] _Linux-based devices and services aren 't
| particularly encumbered by the GPL, so to me it's a
| wash._
|
| Linux uses GPL 2.x. If we are talking about GPL 3.x
| things may be different:
|
| > _One major danger that GPLv3 will block is tivoization.
| Tivoization means certain "appliances" (which have
| computers inside) contain GPL-covered software that you
| can 't effectively change, because the appliance shuts
| down if it detects modified software._
|
| * https://www.gnu.org/licenses/rms-why-gplv3.en.html
|
| If you want to ship (embedded) GPL 2 Linux-based devices
| things are fine, but if there's software that is GPL 3
| then that can cause issues. Linus on GPLv3:
|
| * https://www.youtube.com/watch?v=PaKIZ7gJlRU
| cheema33 wrote:
| > Licensing?
|
| I think you've hit the nail on the head. I used to work
| for a very very large company. And they had a very strong
| preference for BSD licensed software, if we were building
| something using outside software. A few years ago, Stever
| Balmer and others spent a lot of time calling spreading
| FUD by calling Linux and GPL a "cancer". Believe it or
| not, that stuff had a massive impact. Not on small shops,
| but large company lawyers.
|
| Over the years, Steve left Microsoft and Microsoft has
| become a lot more Linux friendly. And the cancer fear has
| subsided. But it was very very real.
|
| On a side note, if I recall correctly, Steve Jobs wanted
| to use Linux for MacOS. But he had licensing concerns and
| called up Linus and asked him to change a few things.
| Linus gave him the middle finger and that is how we got
| MacOS with BSD.
| SoftTalker wrote:
| I doubt this story about Linus and Steve. MacOS is
| NEXTSTEP-derived and that pre-dated Linux by several
| years.
| kelsey98765431 wrote:
| Fun fact! NeXT was a commercial att unix fork. The
| transition from the unix base to the BSD base did infact
| happen at apple after the acquisition. The value of next
| was in its application library, which would eventually
| become the mac foundation libraries like coreaudio and
| cocoa etc. The earliest releases of Rhapsody are very
| illuminative about the architecture of XNU/OSX. I don't
| doubt that linux was considered. There's a specific time
| when the actual move of rhapsody to a freebsd base
| occurred and it was at apple sometime in 97 or 98 iirc.
| teddyh wrote:
| NeXTSTEP was always BSD4.3-Tahoe Unix, not AT&T Unix.
| inopinatus wrote:
| Be all that as it may, this is still a "why not Linux"
| line of thinking, rather than "why FreeBSD", which is the
| more interesting question. And it is not a binary choice.
| toast0 wrote:
| > Incorrect. Case in point, the bisect analysis described
| in the presentation above doesn't exist for Linux, where
| userland distributions develop independently from the
| kernel.
|
| You can certainly bisect the Linux kernel changes though.
| And the bug in question was a kernel bug. For a project
| like this, IMHO, most of the interesting bugs are going to
| be kernel bugs.
| inopinatus wrote:
| Perhaps so, but this knowledge is _post hoc_ for the
| incident and does not undermine the engineering value of
| the unified release.
| toast0 wrote:
| Probably, if you were doing this on Linux, you'd follow
| someone's Linux tree, and not run Debian unstable with
| updates to everything.
|
| You might end up building a lightweight distribution
| yourself; some simple init, enough bits and bobs for in
| place debugging, and the application and its monitoring
| tools.
|
| Anyway, if you did come across a problem where userland
| and kernel changed and it wasn't clear where the breakage
| is, the process is simple. Test new kernel with old
| userland and old kernel with new userland, and see which
| part broke. Then bisect if you still can't tell by
| inspection.
| forgotpwd16 wrote:
| >It's the reason a FreeBSD derivative
|
| _A_ reason. Licensing also plays a role. Some may say the
| most important one.
|
| >Case in point
|
| Not really. This is an advantage, yes, but was inherent to
| BSD development style. Not an addition they did. Assume GP
| refers to other presentations, talking about the
| optimizations needed to get the performance Netflix needs
| from FreeBSD.
|
| >doesn't exist for Linux
|
| But can be done by putting kernel and userland in same repo
| as modules.
| secondcoming wrote:
| 'runs 100% on Linux' is a bit vague. What customisations do
| they do?
| yencabulator wrote:
| Google has ~9000 kernel patches, and a fairly custom
| userspace. Saving 2% at that scale is _huge_.
| hiAndrewQuinn wrote:
| Might be a manpower thing. By hiring a bunch of FreeBSD Core
| devs, Netflix might be able to get a really talented team for
| cheaper than they might get a less ideologically flavored team.
| (I say this as I set up i3 on FreeBSD on my Thinkpad X280, I'm
| a big fan!)
| IshKebab wrote:
| They also get much more control. If they employ most of the
| core FreeBSD devs they basically have their own OS that they
| can do what they like with. If they want some change that
| benefits their use case at the detriment of other people they
| can pretty much just do it.
|
| That's not really possible with Linux.
| guenthert wrote:
| The flip side is of course that in this scenario they would
| have to finance most improvements. In Linux land the cost
| of development is shared across many, which one might
| expect to yield a _generally_ better product.
| toast0 wrote:
| They're at the cutting edge. They're going to have to
| finance the development of most of this stuff anyway.
| yencabulator wrote:
| _This_ development, but not all of the other things that
| go into a kernel and distro. If they pocket the whole
| core team, then they end up paying for all of the work,
| not just the few optimizations they really care about.
| throw0101d wrote:
| > _In Linux land the cost of development is shared across
| many, which one might expect to yield a_ generally
| _better product._
|
| And you have more cooks in the kitchen tweaking things
| that you also want to work on, so there's a higher chance
| of simply conflicts, but also folks that want to go in a
| completely different direction
| technically/philosophically.
| kelsey98765431 wrote:
| When netflix was founded the only viable commercial linux
| vendor was rhel and the support contract would have been
| about the same as just hiring the fbsd core team at
| salary.
|
| People really do not remember the state of early linux.
| Raise a hand if you have ever had to compile a kernel
| module from a flash drive to make a laptops wifi work,
| then imagine how bad it was 20 years before you tried and
| possibly failed at getting networking on linux to work,
| let alone behave in a consistent manner.
|
| The development costs were not yet shared back then, most
| of the linux users at the vendor support level were still
| smaller unproven businesses and most importantly if you
| intend to build something that nobody else has you do not
| exactly want to be super tied to a large community that
| can just take your work and directly go clone your
| product.
|
| Hiring foss devs and putting them under NDA for
| everything they write that doesn't get upstreamed is an
| excellent way to get nearly everything upstreamed aswell,
| and the cost of competitors then porting these merged
| upstream changes back down into their linux is not
| nothing, so this gives a competitive moat advantage.
| toast0 wrote:
| > When netflix was founded the only viable commercial
| linux vendor was rhel and the support contract would have
| been about the same as just hiring the fbsd core team at
| salary.
|
| AFAIK, currently, Netflix only uses FreeBSD for their CDN
| appliances; their user facing frontend and apis live (or
| lived) in AWS on Linux. I don't know what they were
| running on before they moved to the cloud.
|
| I don't think they started doing streaming video until
| 2007 and they didn't start deploying CDN nodes until 2011
| [1]. They started off with commercial CDNs for video
| distribution. I don't know what the linux vendor
| marketplace looked like in 2007-2011, but I'm sure it
| wasn't as niche as in 1997 when Netflix was founded. I
| think they may have been using Linux for production at
| the time that they decided to use FreeBSD for CDN
| appliances.
|
| > Hiring foss devs and putting them under NDA for
| everything they write that doesn't get upstreamed is an
| excellent way to get nearly everything upstreamed aswell,
| and the cost of competitors then porting these merged
| upstream changes back down into their linux is not
| nothing, so this gives a competitive moat advantage.
|
| I don't think Netflix is particularly interested in a
| software moat; or they wouldn't be public about what they
| do and how, and they wouldn't spend so much time
| upstreaming their code into FreeBSD. There's an argument
| to be made that upstreaming reduces their total effort,
| but that's less true the less often you merge. Apple
| almost never merges in upstream changes from FreeBSD back
| into mac os; so not bothering to upstream their changes
| saves them a lot of collaborative work at the cost of
| making an every 10 years process a little bit longer.
|
| At WhatsApp, I don't think we ever had more than 10
| active patches to FreeBSD, and they were almost all tiny;
| it wasn't a lot of effort to port those forward when we
| needed to, and certainly less effort than sending and
| following up on getting changes upstream. We did get a
| few things upstreamed though (nothing terribly
| significant IMHO; I can remember some networking things
| like fixing a syncookie bug that got accidentally
| introduced and tweaking the response for icmp needs frag
| to not respond when the requested mtu was at or bigger
| than the current value; nothing groundbreaking like async
| sendfile or kTLS).
|
| [1] https://web.archive.org/web/20121021050251/https://si
| gnup.ne...
| yencabulator wrote:
| Netflix, the online service, launched in 2007. The
| previous business, doing DVD mailing, had no such high
| bandwidth serving requirements. You're exaggerating the
| timeline quite a lot.
| toast0 wrote:
| It's hard to do apples to apples comparison, because you'd need
| two excellent, committed teams working on this.
|
| I'm a FreeBSD supporter, but I'm sure you _could_ get things to
| work in Linux too. I haven 't seen any posts like 'yeah, we got
| 800 gbps out of our Linux CDN boxes too', but I don't see a lot
| of posts about CDN boxes at all.
|
| As someone else downthread wrote, using FreeBSD gives a lot of
| control, and IMHO provides a much more stable base to work
| with.
|
| Where I've worked, we didn't follow -CURRENT, and tended to
| avoid .0 releases, but it was rare to see breakage across
| upgrades and it was typically easy to track down what changed
| because there usually wasn't a lot of changes in the suspect
| system. That's not really been my experience with Linux.
|
| A small community probably helps get their changes upstreamed
| regularly too.
| bluedino wrote:
| Are there papers out there from other companies that detail
| what performance levels have been achieved using Linux in a
| similar device to the Netflix OCA? Maybe they just use two
| devices that have 90% of the performance?
| bbatha wrote:
| My understanding was that Netflix liked FreeBSD for a few
| reasons, some of them more historical than others.
|
| * The networking stack was faster at the time
|
| * dtrace
|
| * async sendfile(2) https://lists.freebsd.org/pipermail/svn-
| src-head/2016-Januar...
|
| Could they have contributed async sendfile(2) to linux as well?
| Probably. In 2024 these advantages seem to be moot: ebpf,
| io_uring, more maturity in the linux network stack plus FreeBSD
| losing more and more vendor support by the day .
| TheCondor wrote:
| The community is such that if one of either FreeBSD or Linux
| really outperformed the other in some capacity, they'd rally
| and get the gap closed pretty quickly. There are benchmarks
| that flatter both but they tend to converge and be pretty
| close.
|
| Netflix has a team of, I think it was 10 from the PDF,
| engineers working on this platform. A substantial number of
| them are FreeBSD contributors with relationships to it. It's a
| very special team. That's the difference maker here. If it was
| a team of former Red Hat guys, I'm sure they'd be on Linux. If
| it was a team of old Solaris guys, I wouldn't be surprised if
| they were on one of the Solaris offsprings. Then, Netflix knew
| that at their scale and to make it work, they had to build this
| stuff out. That was something they figured out pretty quickly,
| they found the right team and then built around them. It's far
| more sophisticated than "we replaced RedHat Enterprise with
| FreeBSD and reaped immediate performance advantages." At their
| scale, I don't think there is an off the shelf solution.
| kelsey98765431 wrote:
| I think you're close but missing an element. FreeBSD is a
| centrally designed and implemented OS from kernel to libc to
| userland. The entire OS is in the source tree, with updated
| and correct documentation, with coherent design.
|
| Any systems level team will prefer this form of tightly
| integrated solution to the systems layer problem if they are
| responsible for the highly built out specialized distributed
| application we call netflix. The reasons for design choices
| going all the way back to freebsd 1 are available on the
| mailing list in a central place. Everything is there.
|
| Trying to maintain your own linux distro is insanely
| difficult, infact google still uses a 2.2 kernel with all
| custom back ported updates of the last 30 years.
|
| The resources to match the relatively small freebsd design
| and implementation team are minuscule compared to the
| infinite crawl of linuxes, a team of 10 freebsd sysprogs is
| basically the same amount of people responsible for designing
| and building the entire thing.
|
| It comes down to the sources of truth. In the world of fbsd
| that's the single fbsd repo and the single mailing list. For
| a linux, that could be thousands of sources of truth across
| hundreds of platforms.
| TheCondor wrote:
| I understand the point you're trying to make and I agree
| that FreeBSD tends to have cleaner code and better
| documentation at different levels but I don't think that it
| makes it that much more difficult. If you dropped in from a
| different world and had zero experience then I think you're
| right and a systems team would almost always pick BSD. Any
| actual experience pretty quickly swings it the other way
| though; there are also companies dedicated to helping you
| fill in those gaps on the Linux side.
|
| I've built a couple embedded projects on Linux, when you're
| deep on a hard problem, the mailing lists and various
| people are nice, but the "source of truth" is your logic
| analyzer and you debug the damn thing. Or your hardware
| vendor might have some clues as they know more of the bugs
| in their hardware.
| kelsey98765431 wrote:
| Fair points taken, i was a bit zealous in my use of the
| word any, the word many is more correctly applicable.
|
| In regard to sources of truth, i mean from the design
| considerations point of view. For instance, why does the
| scheduler behave a certain way? We can analyze the logic
| and determine the truth of the codes behavior but
| determining the reason for the selection of that design
| for implementation is far more difficult.
|
| These days yes, off the shelf linux will do just fine at
| massively scaling an application. When netflix started
| building blockbuster was still a huge scary thing to be
| feared and respected. Linux was still a hobby project if
| you didn't fork out 70% of a commercial unix contract
| over to rhel.
|
| The team came in with the expectation and understanding
| they would be making massive changes for their
| application that may never be merged upstream. The
| chances of getting a merge in are higher if the upstream
| is a smaller centralized team. You can also just ask the
| person who was in charge of the, let's say for example,
| the init system design and implementation. Or oh, that
| scheduler! Why does it deprioritize x and y over z, is
| there more to why this was chosen than what is on the
| mailing list?
|
| The pros go on and on and the cons are difficult to
| imagine, unless you make a vacuum and compare 2024 linux
| to 2004 linux.
| kstrauser wrote:
| > infact google still uses a 2.2 kernel with all custom
| back ported updates of the last 30 years.
|
| Say what?
| mbilker wrote:
| They are trying to get off or have gotten off their
| kernel fork called "Prodkernel" for some time now.
|
| https://lwn.net/Articles/871195/
| https://events.linuxfoundation.org/wp-
| content/uploads/2022/0...
| yencabulator wrote:
| And ProdKernel generally lags mainstream by only a few
| years, as can be seen in the things you linked to.
|
| The change being talked here is moving from merge ~2
| years to merge all the time. Saying they're stuck on
| something from 1999 is ridiculous.
| yencabulator wrote:
| > infact google still uses a 2.2 kernel with all custom
| back ported updates of the last 30 years.
|
| Linux 2.2 is from 1999. It can barely do SMP. That's pretty
| much a crazy person claim.
|
| Googlers say ProdKernel is merged forward every few years:
|
| > Every two years or so, those patches are rebased onto a
| newer kernel version, which provides a number of
| challenges.
|
| https://lwn.net/Articles/871195/
|
| > Every ~2 years we rebase all these patches over a ~2 year
| codebase delta
|
| https://events.linuxfoundation.org/wp-
| content/uploads/2022/0... (and many other places)
| arp242 wrote:
| Practically this just doesn't matter all that much. You can
| prefer one approach to the other and that's all fine, but
| from a "serious engineering" perspective it just doesn't
| really matter.
|
| > Trying to maintain your own linux distro is insanely
| difficult, infact google still uses a 2.2 kernel with all
| custom back ported updates of the last 30 years.
|
| 2.2? Colour me skeptical on that.
|
| But it's really not that hard to make a Linux distro. I
| know, because I did it, and tons of other people have. It's
| little more than "get kernel + fuck about with Linux
| booting madness + bunch of userland stuff". The same
| applies to FreeBSD by the way, and according to the PDF
| "Netflix has an internal "distro"".
|
| The problems Google has is because they maintain extensive
| patchsets, not because they built their own distro. They
| would have the same problems with FreeBSD, or any other
| complex piece of software.
| Thaxll wrote:
| I'm still surprised they did not get away from Nginx at that
| point and did something like Cloudflare.
| alberth wrote:
| When you're pushing the amount of data Netflix is, you need to
| work directly with the ISP & Exchanges.
|
| At least that's my guess.
| yencabulator wrote:
| Parent is talking purely about software.
|
| https://blog.cloudflare.com/how-we-built-pingora-the-
| proxy-t...
| toast0 wrote:
| Do you mean moving away from using Nginx, like Cloudflare moved
| to a custom replacement? [1]
|
| I don't think that's as needed for Netflix. My understanding is
| their CDN nodes are using Nginx to serve static files --- their
| content management system directs off peak transfers, and
| routes clients to nodes that it knows has the files. They don't
| run a traditional caching (reverse) proxy, and they most likely
| don't hit a lot of the things Cloudflare was hitting, because
| their use case is so different.
|
| (I haven't seen a discussion of how Netflix handles live
| events, that's clearly a different process than their
| traditional prerecorded media)
|
| [1] https://blog.cloudflare.com/how-we-built-pingora-the-
| proxy-t...
| drewg123 wrote:
| Yes, exactly.
| Thaxll wrote:
| Yes I'm talking about Nginx running on those bsd boxes, they
| have such a custom design that writing your own static file
| process would have made sense.
| toast0 wrote:
| Getting an HTTP server to work with the diversity of HTTP
| clients in the world that Netflix supports is not going to
| be fun, and NGINX is right there.
|
| As I understand it, they've made some changes to NGINX, but
| I don't think they've made a lot, and I don't think
| anything where the structure of NGINX was not conducive to
| it or limiting.
|
| I'm not one to shy away from building a custom design, but
| it's a lot easier when you control the clients, and Netflix
| has to work with everything from browsers to weird SDKs in
| TVs.
|
| Netflix OCA performance seems mostly bottlenecked on
| I/O/memory bandwidth (and cpu overhead for pacing?) and any
| sensible HTTP server for static files isn't going to use a
| lot of memory bandwidth processing inbound requests and
| calling sendfile on static files. So why spend the limited
| time of a small team to build something that's not going to
| make a big difference.
| ay wrote:
| Sometimes it's still hard to tackle people psychology who are
| used to "comfort" of the "sta[b]le" branches.
|
| So at work I came up with the following observation: if you are a
| consumer and are afraid of unpredictable events at head of
| master/main branch - if you use the head of master/main from 21
| days ago, you have 3 weeks of completely predictable and
| _modifiable_ future.
|
| Any cherry picks made are made during the build process, so there
| is no branch divergence - if the fix gets up streamed it is not
| cherry picked anymore.
|
| Thus, unlike with stable branches, by default it converges back
| to master.
|
| "But what if the fix is not upstreamed ?" - then it gets to be
| there, and depending on the nature of the code it bears a bigger
| or smaller ongoing cost - which reflects well the technical debt
| as it is.
|
| This has worked pretty well for the past 4-5 years and is now
| used for quite a few projects.
| bongodongobob wrote:
| This is how OS updates have worked at every company I've been
| at. Either have a handful of devices that get them immediately
| and you scream test, or you simply just wait 3 weeks, then roll
| them out. (Minus security CVE's of course)
| kazinator wrote:
| What are they bisecting with? FreeBSD uses CVS and Perforce.
|
| These items in the slide don't add up:
|
| > _Things had worked accidentally for years, due to linkerset
| alphabetical ordering_
|
| In other words, the real bug is many years old. Yet, on the last
| slide:
|
| > _Since we found the bug within a week or so of it hitting the
| tree, the developer responsible was incredibly responsive. All
| the details were fresh in his memory._
|
| What? Wasn't what hit the tree the ordering change which exposed
| the bug? It doesn't seem like that being fresh in anyone's mind
| is of much help.
| sakjur wrote:
| FreeBSD has moved to Git as its primary VCS, see
| https://lists.freebsd.org/pipermail/freebsd-current/2020-Dec...
| drewg123 wrote:
| FreeBSD hasn't use CVS and P4 in decades. FreeBSD uses git
| internally, and has a github mirror. See
| https://docs.freebsd.org/en/articles/committers-guide/
|
| The unintentional ordering change being fresh Colin's memory
| was really helpful, as he quickly realized that he'd actually
| changed the ordering and sent me a patch to fix it. If it had
| been 3 years in the past, I suspect he would have been less
| responsive (I know I would have been).
| pronoiac wrote:
| There were two sets of bugs:
|
| * the new sort handled ties differently. They adjusted that,
| and they _could_ have stopped there.
|
| * the other was that the correct drivers were sensitive to
| loading order. They handled this and another driver bug,
| amdtemp. This was masked by the old sort.
|
| If they'd found this years later, even investigating the first
| set would be a slower process - it wouldn't be fresh in minds,
| and it likely would then have other code relying on it, so
| adjusting it would be trickier.
| kazinator wrote:
| Why would you alphabetically order initializations?
|
| Every complex system I've ever worked on that had a large number
| of initializations was sensitive to orders.
|
| Languages with module support like Wirth's Modula-2 ensure that
| if module A uses B, B's initialization will execute before A's.
| If there is no circular dependency, that order will never be
| _wrong_.
|
| The reverse order could work too, but it's a crapshoot then.
| Module dependency doesn't logically _entail_ initialization order
| dependency: A 's initializations might not require B's
| initializations to have completed.
|
| If you're initializing by explicit calls in a language that
| doesn't automate the dependencies, the baseline safest thing to
| do is to call things in a fixed order that is recorded somewhere
| in the code: array of function addresses, or just an init
| procedure that calls others directly.
|
| If you sort the init calls, it has to be on some property linked
| to dependency order, otherwise don't do it. Unless you've encoded
| something related to dependencies into the module names,
| lexicographic order is not right.
|
| In the firmware application I work on now, all modules have a
| statically allocated signature word that is initially zero and
| set to a pattern when the module is initialized. The external API
| functions all assert that the pattern has the correct value,
| which is strong evidence that the module had been initialized
| before use.
|
| One one occasion I debugged a static array overrun which trashed
| these signatures, causing the affected modules to assert.
| toast0 wrote:
| Having a consistent ordering avoids differences in results from
| inconsistent ordering by construction. IIUC, Alpha sort was/is
| used as a tie breaker after declared dependencies or other
| ordering information.
|
| In this case, two (or more) modules indicate they can handle
| the same hardware and didn't have information on priority if
| both were present. Probably this should be detected / raise a
| fault, but under the previous regime of alpha sort, it was
| handled nicely because the preferred drivers happened to sort
| first.
| andrewstuart wrote:
| >> Most of us are FreeBSD committers or contributors
|
| This is why you shouldn't copy this strategy.
| arp242 wrote:
| Most of us also aren't running Netflix, or anything of the
| sort. All the big companies that heavily use Linux at scale
| also employ tons of Linux kernel engineers.
| andrewstuart wrote:
| I love FreeBSD and in a different world really it should occupy
| the place Linux does.
|
| But the reality is it's a Linux world and now Linux has systemd
| which makes switching to anything else not an option for me.
|
| You'd have to pry systemd from my cold, dead hands.
___________________________________________________________________
(page generated 2024-06-10 23:01 UTC)