[HN Gopher] No More Blue Fridays
       ___________________________________________________________________
        
       No More Blue Fridays
        
       Author : moreati
       Score  : 369 points
       Date   : 2024-07-22 12:21 UTC (10 hours ago)
        
 (HTM) web link (www.brendangregg.com)
 (TXT) w3m dump (www.brendangregg.com)
        
       | xg15 wrote:
       | > _In the future, computers will not crash due to bad software
       | updates, even those updates that involve kernel code. In the
       | future, these updates will push eBPF code._
       | 
       | Assuming every security critical system will be on a recent
       | enough kernel to support this...
        
         | efee22 wrote:
         | I think with a LTS distribution you should get very far these
         | days when it comes to implementing such sensors.
        
           | chasil wrote:
           | On rhel8 variants, you can use the Oracle UEK to get eBPF.
           | 
           | https://blogs.oracle.com/linux/post/oracle-linux-and-bpf
           | $ cat /etc/redhat-release /etc/oracle-release /proc/version
           | Red Hat Enterprise Linux release 8.10 (Ootpa)       Oracle
           | Linux Server release 8.10       Linux version
           | 5.15.0-203.146.5.1.el8uek.x86_64
           | (mockbuild@host-100-100-224-48) (gcc (GCC) 11.2.1 20220127
           | (Red Hat 11.2.1-9.2.0.1), GNU ld version 2.36.1-4.0.1.el8_6)
           | #2 SMP Thu Feb 8 17:14:39 PST 2024
        
         | dijit wrote:
         | And assuming there's no bugs in the BPF code...
         | 
         | Oh wait: https://news.ycombinator.com/item?id=41031699
        
           | efee22 wrote:
           | RHEL kernel.. right. Imho, I'd trust an upstream stable
           | kernel far more than a RHEL one for production which has
           | dozen of feature backports and an internal kABI to maintain..
           | granted RH has a QA team, but it is still impossible to test
           | everything beforehand.
        
             | worthless-trash wrote:
             | On the upside, non root users can't insert ebpf code, so
             | its a priv'ed operation, not like other distros.
        
               | nequo wrote:
               | Isn't it tied to CAP_BPF on every distro since the 5.8
               | kernel?
               | 
               | https://mdaverde.com/posts/cap-bpf/
        
         | dredmorbius wrote:
         | Considering the number of systems running very obsolete OSes
         | these days: WinNT (4x or 3x), Windows, DOS, or various
         | proprietary Unixen, stale Linux flavours, etc., etc., ... yes,
         | quite.
        
       | usrme wrote:
       | Does anyone know how far along the eBPF implementation for
       | Windows actually is? In the sense that it could start feasibly
       | replacing existing kernel drivers.
        
       | CoastalCoder wrote:
       | > If your company is paying for commercial software that includes
       | kernel drivers or kernel modules, you can make eBPF a
       | requirement.
       | 
       | Are they saying that device drivers should be written in eBPF?
       | 
       | Or maybe their drivers should expose an eBPF API?
       | 
       | I assume _some_ driver code still needs to reside in the actual
       | kernel.
        
         | prmoustache wrote:
         | These tool wouldn't need kernel drivers, only to target the
         | eBPF userspace API:
         | https://www.kernel.org/doc/html/latest/userspace-api/ebpf/in...
        
       | asynchronous wrote:
       | Is there a reason for the lack of naming+shaming Crowdstrike in
       | this blogpost? Was it to not give them any more publicity, good
       | or bad?
        
         | StevenWaterman wrote:
         | If you consider kernel programming to be inherently unsafe,
         | then you would consider this to be inevitable, meaning it's not
         | really the specific company's fault. They were just the unlucky
         | ones.
        
           | efee22 wrote:
           | Agree, Crowdstrike was an unlucky one, but it is more about
           | the issue in general. If I remember correctly, also others
           | like sysdig user their own kernel modules for collection.
        
           | asynchronous wrote:
           | I still hold true that testing even improperly would have
           | caught this before it hit worldwide. But I suppose you are
           | right, that doesn't help the argument being made here.
        
             | ForOldHack wrote:
             | Wasnt that the job of AI/co-pilot/clippy /D.E.P? "Would you
             | like me to try and execute a random blank file?"
             | 
             | And of course QA.
             | 
             | I was unaffected, but was fielding calls from customers.
             | 
             | My update Tuesday is the week after, so in-between MS and
             | my updates, I am very suspicious of everything.
             | 
             | I was also unaffected by 22H2, and spent time fielding
             | calls.
        
           | lordnacho wrote:
           | They could have helped their luck by doing some of the common
           | sense things suggested in the article.
           | 
           | For instance, why not find a subset of your customers that
           | are low risk, push it out to them, and see what happens? Or
           | perhaps have your own fleet of example installations to run
           | things on first. None of which depends on any specific
           | technology.
        
             | hello_moto wrote:
             | "find a subset of low risk customers" and use them as test
             | subject?
             | 
             | Repeat that a few times to understand the repercussions.
             | 
             | If I were the customers and I found out that I was used as
             | test subject, how would I feel?
        
               | whynotminot wrote:
               | Canary deployments are already an industry accepted
               | practice and it's shocking Crowdstrike apparently doesn't
               | do them.
        
               | hello_moto wrote:
               | Which industry? Cybersecurity or Cloud software?
        
               | whynotminot wrote:
               | Any industry that wants to reliably deliver software that
               | doesn't brick systems at scale? I'm confused by your
               | question.
               | 
               | Are you telling me the cybersecurity scene is special and
               | shouldn't follow best practices for software deployment?
        
               | hello_moto wrote:
               | Canary deployment for subset of Salesforce customers
               | won't see much of revolt from customers compare to AV
               | definition rollout (not software, but AV definition) in
               | Cybersecurity where gaps between 0day and rollout means
               | you're exposed.
               | 
               | If customers found out that some are getting roll out
               | faster than the others, essentially splitting the group
               | into 2, there will be a need for customer opt-in/opt-out.
               | 
               | If everyone is opting-out because of Friday, your Canary
               | deployment becomes meaningless.
               | 
               | Any proof that other Cybersecurity vendors do Canary
               | deployment for their AV definition? :)
               | 
               | PS: not to say that the company should test more
               | internally...
        
               | whynotminot wrote:
               | Canary deployment doesn't necessarily mean massive gaps
               | between deployment waves. You can fast-follow. Sure,
               | there may be scenarios with especially severe
               | vulnerabilities where time is of the essence. I'm out of
               | the loop if this crowdstrike update was such a scenario
               | where best practices for software deployment were worth
               | bypassing.
               | 
               | If this is just how they roll with regular definition
               | updates, then their deployment practices are garbage and
               | this kind of large scale disaster was inevitable.
        
               | hello_moto wrote:
               | Let's walk this through: Canary deployment to Windows
               | machines. If those Windows machines got hit with BSOD,
               | they will go offline. How do you determine if they go
               | offline because of Canary or because of regular
               | maintenance by the customer's IT cycle?
               | 
               | You can guess, but you cannot be 100% sure.
               | 
               | What if the targeted canary deployments are Employees
               | desktops that are OFFLINE during the time of rollout?
               | 
               | >I'm out of the loop if this crowdstrike update was such
               | a scenario where best practices for software deployment
               | were worth bypassing.
               | 
               | I did post a question: what about other Cybersecurity
               | vendors? Do you think they do canary deployment on their
               | AV definitions?
               | 
               | Here's more context to understand Cybersecurity:
               | https://radixweb.com/blog/what-is-mean-time-to-detect
               | 
               | Cybersecurity companies participate in Sec evaluation
               | annually that evaluates (measure) and grade their
               | performance. That grade is an input for Organizations to
               | select vendors outside their own metrics/measurements.
               | 
               | I don't know if MTTD is included in the contract/SLA. If
               | it does, you got some answer as to why certain decision
               | is made.
               | 
               | It's definitely interesting to see Software developers of
               | HN giving out their 2c for a niche Cybersecurity
               | industry.
        
               | whynotminot wrote:
               | > You can guess, but you cannot be 100% sure.
               | 
               | I worked in the cyber security space for a decent chunk
               | of my career, and the most frustrating part was cyber
               | security engineers thinking their problems were unique
               | and being completely unaware of the lessons software
               | engineering teams have already learned.
               | 
               | Yes, you need to tune your canary deployment groups to be
               | large and diverse enough to give a reliable indicator of
               | deployment failure, while still keeping them small enough
               | that they achieve their purpose of limiting blast radius.
               | 
               | Again, if you follow industry best practices for software
               | deployment, this is already something that should be
               | considered. This is a relatively solved problem -- this
               | is not new.
               | 
               | > I did post a question: what about other Cybersecurity
               | vendors? Do you think they do canary deployment on their
               | AV definitions?
               | 
               | I think that question is being asked right now by every
               | company using Crowdstrike -- what vendors are actually
               | doing proper release engineering and how fast can we
               | switch to them so that this never happens to us again?
        
               | hello_moto wrote:
               | >if you follow industry best practices for software
               | deployment, this is already something that should be
               | considered. This is a relatively solved problem -- this
               | is not new.
               | 
               | You have to ask the customer if they're okay with that
               | citing "our software might failed and brick your
               | machine".
               | 
               | I'd like to see any Sales and Marketing folks say that ;)
               | 
               | > I think that question is being asked right now by every
               | company using Crowdstrike -- what vendors are actually
               | doing proper release engineering and how fast can we
               | switch to them so that this never happens to us again?
               | 
               | Uber valid question and this BSOD incident might be a
               | turning point for customers to pay up more for their IT
               | infrastructure.
               | 
               | It's like: previously Cybersecurity vendors are shy to
               | ask customers to setup Canary systems because that's just
               | "one-more-thing-to-do". After BSOD: customers will
               | smarten up and do it without being asked and to the point
               | where they would ask Vendors to _support_ that type of
               | deployment (unless they continue to be cheap and lazy).
        
               | whynotminot wrote:
               | > You have to ask the customer if they're okay with that
               | citing "our software might failed and brick your
               | machine".
               | 
               | I think you're still missing the point of Canary
               | deployments. The question your sales team should ask is
               | "would you like a 5% chance of a bug harming your system,
               | or a 100% chance?"
               | 
               | > It's like: previously Cybersecurity vendors are shy to
               | ask customers to setup Canary systems because that's just
               | "one-more-thing-to-do"
               | 
               | You should by shy because it is not your customer's job
               | to set up canary deployments. Crowdstrike owns the
               | software and the deployment process. They should be
               | deploying to a subset of machines, measuring the results,
               | and deciding whether to roll forward or roll back. It is
               | not the customers job to implement good release
               | engineering controls for Crowdstrike (although after this
               | debacle you may well see customers try).
        
               | hello_moto wrote:
               | If you refer Canary deployment as the vendor's internal
               | deployment? I definitely agree.
               | 
               | What I find it hard is those in Software that suggested
               | to roll it to a few customers first because this isn't
               | cloud deployment doing A/B test when it comes to Virus
               | Definition.
               | 
               | Customers must know what's going on when it comes to
               | virus definition and the implication of them whether
               | they're being part of the rollout group or not.
        
               | whynotminot wrote:
               | > If you refer Canary deployment as the vendor's internal
               | deployment? I definitely agree.
               | 
               | No, I'm talking about external deployment to customers.
               | They clearly also had a massive failure in their internal
               | processes too, since a bug this egregious should never
               | make it to the release stage. But that is not what I am
               | talking about right now.
               | 
               | > What I find it hard is those in Software that suggested
               | to roll it to a few customers first because this isn't
               | cloud deployment doing A/B test when it comes to Virus
               | Definition.
               | 
               | I don't care what you're releasing to customers--
               | application binary, configuration change, virus
               | definition, etc, if it has the chance of doing this much
               | damage it must be deployed in a controlled, phased way.
               | You cannot 100% one-shot deploy any change that has the
               | potential to boot-loop a massive amount of systems like
               | this. This current process is unacceptable.
               | 
               | > Customers must know what's going on when it comes to
               | virus definition and the implication of them whether
               | they're being part of the rollout group or not.
               | 
               | Who says they don't have to know? Telling your customers
               | that an update is planned and giving them a time window
               | for their update seems reasonable to me.
        
               | hello_moto wrote:
               | If it's virus defn, what's the process here?
               | 
               | * 0day is happening
               | 
               | * Cybersecurity vendors preparing virus definition
               | 
               | * Vendors send update => new virus definition is about to
               | go down in 1 hour, get ready.
               | 
               | Folks are asleep, nobody reads it?
               | 
               | Let's say now let's do Canary: let's deploy to a few
               | customers (this is unclear how this started: should this
               | be opt-in? opt-out?)
               | 
               | Some customers got it, others... who knows, unclear what
               | the processes are here.
               | 
               | Between here and there, 0day exploited customers because
               | AV defn is not there. What now?
               | 
               | I'm not sure how this plays out tbh.
        
               | lordnacho wrote:
               | > If I were the customers and I found out that I was used
               | as test subject, how would I feel?
               | 
               | In reality, every business has relationships that it
               | values more than others. If I wasn't paying a lot for it,
               | and if I was running something that wasn't critical (like
               | my side project) then why not? You can price according to
               | what level of service you want to provide.
        
               | hello_moto wrote:
               | Customers will ask to opt-out.
        
               | ahtihn wrote:
               | Customers will _pay_ to opt out.
        
             | gtsop wrote:
             | Why even do that? We have virtualization, they could
             | emulate real clients and networks of clients. This
             | particular bug would have been prevented for sure
        
               | lordnacho wrote:
               | Yeah I thought maybe the VM thing might not catch the bug
               | for some reason, but it seems like the natural thing to
               | do. Spin up VM, see if there's a crash. I heard the
               | technical reason had something to do with a file being
               | full of nulls, but that sort of thing you should catch.
               | 
               | Honestly, the most generous excuse I can think of is that
               | CS were informed of some sort of vulnerability that would
               | have profound consequences immediately, and that
               | necessitated a YOLO push. But even that doesn't seem too
               | likely.
        
           | brendangregg wrote:
           | Right, and we wanted to talk about all security solutions and
           | not make this about one company. We also wanted to avoid
           | shaming since they have been seriously working on eBPF
           | adoption, so in that regard they are at the forefront of
           | doing the right thing.
        
         | hiddencost wrote:
         | I think the article isn't about crowd strike. It's about ebpf.
        
           | pimlottc wrote:
           | The second paragraph is 100% about Crowdstrike. It even links
           | to the Wikipedia article:
           | 
           | https://en.m.wikipedia.org/wiki/2024_CrowdStrike_incident
        
             | hiddencost wrote:
             | CrowdStrike is mentioned, but the goal of the article is to
             | promote eBPF. CrowdStrike is tangentially related because
             | it draws attention to a platform that Gregg has put a lot
             | into.
        
       | kayo_20211030 wrote:
       | This isn't right. If I need a system to run _with_ a piece of
       | code, then it shouldn 't run at all if that piece of code is
       | broken. Ignoring the failure is perverse. Let's say that the
       | driver code ensures that some medical machine has safety locks
       | (safeguards) in place to make sure that piece of equipment won't
       | fry you to a crisp; I'd prefer that the whole thing not run at
       | all rather than blithely operate with the safeguards disabled.
       | It's turtles all the way down.
        
         | Smaug123 wrote:
         | I think the premise is false? It's up to the eBPF implementor
         | what to do in the case of invalid input; the kernel could
         | choose to perform a controlled shutdown in that case. (I have
         | no idea what e.g. Linux actually does here, but one could
         | imagine worlds where the action it takes on invalid input is
         | configurable.)
         | 
         | Also your statement is _sometimes_ not true, although I
         | certainly sympathise in the mainline case. In some contexts you
         | really do need to keep on trucking. The first example to spring
         | to mind is  "the guidance computers on an automated Mars
         | lander"; the round-trip to Earth is simply too long to defer
         | responsibility in that case. If you shut down then you _will_
         | crash, but if you do your best from a corrupted state then you
         | merely _probably_ crash, which is presumably better.
        
           | umanwizard wrote:
           | > I have no idea what e.g. Linux actually does here
           | 
           | If you attempt to load an eBPF program that the verifier
           | rejects, the syscall to load it fails with EINVAL or E2BIG.
           | What your user-space program then does is up to you, of
           | course.
        
         | phartenfeller wrote:
         | The medical machine software should just refuse to run with an
         | error message if a critical driver was not loaded. The OS
         | bricking is causing way more trouble where an IT technician now
         | needs to fix something where it otherwise would just be
         | updating the faulty driver... Also does your car not start if
         | you are missing water for the wiper?
        
           | jve wrote:
           | Water for the wiper is userland feature.
           | 
           | 3rd party hooking into kernel is 3rd party responsibility. It
           | is like equipping your car with LPG - THAT hooks into engine
           | (kernel). And When I had a faulty gas pressure sensor then my
           | car actually halted (BSOD if you will) instead of
           | automatically failing over to gasoline as it is by design.
           | 
           | You can argue that car had no means to continue execution but
           | kernel has, however invalid kernel state can cause more
           | corruption down the road. Or as parent even points out -
           | carry out lethal doses of something.
        
             | pinebox wrote:
             | Initially I was inclined to disagree ("these things should
             | always fail safe") however with more and more stuff being
             | pushed into the kernel it's hard to say that you're wrong
             | or exactly where a line needs to be drawn between
             | "minimally functional system" and "dangerously out of
             | control system".
             | 
             | I think until we discover a technology that forces
             | commercial software vendors to employ functioning QA
             | departments none of this will really solve anything.
        
         | ChrisMarshallNY wrote:
         | _> Ignoring the failure is perverse._
         | 
         | If the failed system is a security module, I think that's
         | absolutely correct. If the system runs, without the security
         | module, well, that's like forgetting to pack condoms on Shore
         | Leave. You'll likely be bringing something back to the ship
         | with you.
         | 
         |  _Someone_ needs to be testing the module, and the enclosing
         | system, to make sure it doesn 't cause problems.
         | 
         | I suspect that it got a great deal of automated unit testing,
         | but maybe not so much fuzz and monkey (especially "Chaos
         | Monkey"-style) testing.
         | 
         | It's a fuzzy, monkey-filled world out there...
        
           | kayo_20211030 wrote:
           | Interesting analogy, but yes. If the module *is* necessary,
           | well, it's necessary and nothing should work without it.
           | Testing must have been a mess here.
        
         | __MatrixMan__ wrote:
         | I like how Unison works for this reason. You call functions by
         | cryptographic hash, so you have some assurance that you're
         | calling the same function you called yesterday.
         | 
         | Updates would require the caller to call different functions
         | which means putting the responsibility in the hands of the
         | caller, where it should be, instead of on whoever has a side
         | channel to tamper with the kernel.
         | 
         | You end up with the work-perfectly-or-not-at-all behavior that
         | you're after because if the function that goes with the
         | indicated hash is not present, you can't call it, and if it is
         | present you can't call it in any way besides how it was
         | intended
        
         | enragedcacti wrote:
         | I agree that some system components should be treated as
         | critical no matter what, but the software at issue in this case
         | (Falcon Sensor or Antivirus more generally) is precautionary
         | and only best effort anyways. I would wager the vast majority
         | of the orgs affected on Friday would have preferred the
         | marginally increased risk of a malware attack or unauthorized
         | use over a 24 hour period instead of the total IT collapse they
         | experienced. Further, there's no reason the bug HAD to cause a
         | BSOD, it's possible the systems could have kept on trucking but
         | with an undefined state and limitless consequences. At least
         | with eBPF you get to detect a subset of possible errors and
         | make a risk management decision based on the result.
        
           | kayo_20211030 wrote:
           | I'm with you. What's critical, and what's not? Is it a big
           | thing, or not a big thing? Is this particular machine more
           | critical than the one over there? Security systems need to be
           | at the lowest level, or else some shifty bastard will find a
           | path around them. If it's at the lowest level, the downside
           | of a failure is catastrophic, as we experienced last Friday.
           | The carnage here is ultimately on CrowdStrike. The testing
           | must have been slapdash at best, and missing at worst. eBPF
           | changes nothing. The question is: should we fail, or carry
           | on? eBPF doesn't help with that decision, it only determines
           | the outcome from a system perspective. Any decision is a
           | value judgement; it might be right or wrong, and its outcome
           | either benign or deadly. Choices!
        
         | emn13 wrote:
         | The system clearly already behaves that way (i.e. ignores
         | failure) - after all, the fix was to simply delete the
         | offending file. If that's an option, then loader can do that
         | too. It can and perhaps even is smarter, such as "fallback onto
         | previous version".
         | 
         | Furthermore, the reaction to a malformed state need not be
         | "ignore". It could disable restricted user login; or turn off
         | the screen.
         | 
         | If the worry is that this is viable to abuse by malware, well,
         | if the malware can already rewrite the on-disk files for the
         | AV, I wonder whether it's really a good idea to trust the
         | system itself to be able to deal with that. It'd probably be
         | safer to just report that up the security foodchain, and
         | potentially let some external system take measures such as
         | disable or restrict network access. Better yet, such measures
         | don't even require the same capabilities to intervene in the
         | system, merely to observe - which makes the AV system less
         | likely to serve as a malware vector itself or to cause bugs
         | like this.
        
       | shrx wrote:
       | From the article:
       | 
       | > If the verifier finds any unsafe code, the program is rejected
       | and not executed. The verifier is rigorous -- the Linux
       | implementation has over 20,000 lines of code [0] -- with
       | contributions from industry (e.g., Meta, Isovalent, Google) and
       | academia (e.g., Rutgers University, University of Washington).
       | 
       | [0] links to
       | https://github.com/torvalds/linux/blob/master/kernel/bpf/ver...
       | which has this interesting comment at the top:
       | /* bpf_check() is a static code analyzer that walks eBPF program
       | * instruction by instruction and updates register/stack state.
       | * All paths of conditional branches are analyzed until 'bpf_exit'
       | insn.          *          * The first pass is depth-first-search
       | to check that the program is a DAG.          * It rejects the
       | following programs:          * - larger than BPF_MAXINSNS insns
       | * - if loop is present (detected via back-edge)         ...
       | 
       | I haven't inspected the code, but I thought that checking for
       | infinite loops would imply solving the halting problem. Where's
       | the catch?
        
         | dtx1 wrote:
         | I have no insight into this particular project but you could
         | work around the halting problem by only allowing loops you can
         | proof will not go infinite. That would of course imply
         | rejecting loops that won't go infinite but can't be proven not
         | to.
        
         | hiddencost wrote:
         | Unterminated loops might be a better phrasing.
        
         | efee22 wrote:
         | Infinite loops are not possible and would get rejected by the
         | verifier since it cannot solve the halting problem. Here is a
         | good overview on the options available: https://ebpf-
         | docs.dylanreimerink.nl/linux/concepts/loops/
        
         | skywhopper wrote:
         | If the verifier can't determine that the loop will halt, the
         | program is disallowed. Also, if the program gets passed and
         | then runs too long anyway, it's force-halted. So... I guess
         | that solves the halting problem.
        
           | neaanopri wrote:
           | It's more accurate to say that in principle, there could be
           | programs that would halt, but that the verifier will deny.
        
           | lucianbr wrote:
           | So this "solves" the halting problem by creating a new class
           | "might-not-halt-but-not-sure" and lumping it with "does-not-
           | halt". I find it hard to believe the new class is small
           | enough for this to be useful, in the sense that it will avoid
           | all kernel crashes.
           | 
           | I rather expect useful or needed code would be rejected due
           | to "not-sure-it-halts", and then people will use some kind of
           | exception or not use the verifier at all, and then we are
           | back to square one.
        
             | umanwizard wrote:
             | Well it is useful in practice, there are some pretty useful
             | products based on eBPF on Linux, most notably Cilium (and,
             | shameless plug for the one I'm working on: Parca, an eBPF-
             | based CPU profiler).
        
               | lucianbr wrote:
               | Bad wording on my part, and I still don't know how to
               | word it better. I'm sure this thing is useful, I don't
               | think everyone who contributed code was just clueless.
               | 
               | However, the claim "in the future, computers will not
               | crash due to bad software updates, even those updates
               | that involve kernel code" must be false. There is no way
               | it is true. Whatever Cilium is, I cannot believe it
               | generally prevents kernel crashes.
        
               | umanwizard wrote:
               | Correct, you will never be able to write any possible
               | arbitrary code and have it run in eBPF. It necessarily
               | constrains the class of programs you can write. But the
               | constrained set is still quite useful and probably
               | includes the crowdstrike agent.
               | 
               | Also, although this isn't the case now, it's possible to
               | imagine that the verifier could be relaxed to allow a
               | Turing-complete subset of C that supports infinite loops
               | while still rejecting sources of UB/crashes like
               | dereferencing an invalid pointer. I suspect from reading
               | this post that that is the future Mr. Gregg has in mind.
               | 
               | > Whatever Cilium is, I cannot believe it generally
               | prevents kernel crashes.
               | 
               | It doesn't magically prevent all kernel crashes from
               | unrelated code. But what we can say is that Cilium itself
               | can't crash the kernel unless there are bugs in the eBPF
               | verifier.
        
               | lucianbr wrote:
               | If the verifier allowed a Turing-complete language, it
               | would solve the halting probem, which is impossible.
        
               | umanwizard wrote:
               | My point is that the verifier could be relaxed to accept
               | programs that never halt, thus not needing to solve the
               | halting problem. You could then have the kernel just kill
               | it after running over a certain maximum amount of time.
        
               | lucianbr wrote:
               | Why do you think the kernel crashes when crowdstrike
               | attempts to reference some unavailable address (or
               | whatever it does) instead of just denying that operation
               | and continuing on? That would be the solution using this
               | philosophy "just kill long running program". And no need
               | for eBPF or anything complicated. But it doesn't work
               | that way in practice.
               | 
               | This is just such a naive view. "We can prevent programs
               | from crashing by just taking care to stop them when they
               | do bad things". Well, sure, that's why you have a kernel
               | and userland. But it turns out, some things need to run
               | in the kernel. Or "just deny permission". Then it turns
               | out some programs need to run as admin. And so on.
               | 
               | There is a generality in the halting problem, and saying
               | "we'll just kill long runing programs" just misses the
               | point entirely.
               | 
               | Likely what will happen is that you will kill useful
               | long-running programs, then an exception mechanism will
               | be invented so some programs will not be killed, because
               | they need to run longer, then one of those programs will
               | go into an infinite loop despite all your mechanisms
               | preventing it. Just like the crowdstrike driver managed
               | to bring down the OS despite all the work that is
               | supposed to prevent the entire computer crashing if a
               | single program tries something stupid.
        
               | umanwizard wrote:
               | > Why do you think the kernel crashes when crowdstrike
               | attempts to reference some unavailable address (or
               | whatever it does) instead of just denying that operation
               | and continuing on?
               | 
               | Linux and windows are completely monolithic kernels; the
               | crowdstrike agent isn't running in a sandbox and has
               | complete unfettered access to the entire kernel address
               | space. There is no separate "the kernel" to detect when
               | the agent does something wrong; once a kernel module is
               | loaded, IT IS the kernel.
               | 
               | Lots of people have indeed realized this is undesirable
               | and that there should be a sandboxed way to run kernel
               | code such that bugs in it can't cause arbitrarily bad
               | undefined behavior. Thus they invented eBPF. That's
               | precisely what eBPF is.
               | 
               | I don't know whether it's literally true that someday you
               | will be able to write all possibly useful kernel-mode
               | code in eBPF. But the spirit of the claim is true:
               | there's a huge amount of useful software that could be
               | written in eBPF today on Linux instead of as kernel
               | modules, and this includes crowdstrike. Thus Windows
               | supporting eBPF, and crowdstrike choosing to use it,
               | would have solved this problem. That set of software will
               | increase as the eBPF verifier is enhanced to accept a
               | wider variety of programs.
               | 
               | Just like you can write pretty much any useful program in
               | JavaScript today -- a sandboxed language.
               | 
               | You're also correct that due to the halting problem,
               | we'll either have to accept that eBPF will never be
               | Turing complete, OR accept that some eBPF programs will
               | never halt and deal with the issues in other ways. Just
               | like Chrome's JavaScript engine has to do. I don't really
               | view this as a fundamentally unsolvable issue with the
               | nature of eBPF.
        
               | tptacek wrote:
               | The claim isn't that eBPF generally prevents kernel
               | crashes. It's that it prevents crashes in the subset of
               | programs it's designed for, in particular for
               | instrumentation, which Crowdstrike is (in this author's
               | conception) an instance of.
        
               | lucianbr wrote:
               | I have quoted the claim verbatim from the article. It is
               | obviously the claim of the article.
        
               | tptacek wrote:
               | It's referring to _Windows security software_. If you
               | have a lot of context with eBPF, which Gregg obviously
               | does, the notion that eBPF will subsume the entire kernel
               | doesn 't even need to be said: you can't express
               | arbitrary programs in eBPF. eBPF is safe because the
               | verifier rejects the vast majority of valid programs.
        
             | tptacek wrote:
             | Lots of useful code is rejected due to "not-sure-it-halts".
             | That's the premise.
        
         | pkhuong wrote:
         | The basic logic flags _any_ loop ( "back-edge").
        
           | rezonant wrote:
           | This, others have said it less concisely, but a program
           | without loops and arbitrary jumps is guaranteed to halt if we
           | assume the external functions it calls into will halt.
        
         | atrus wrote:
         | The halting problem is exhaustive, there isn't an algorithm
         | that is valid for all programs. You can still check for some
         | kinds of infinite loops though!
        
           | roywiggins wrote:
           | More specifically, you can accept a set of programs that you
           | are certain do halt, and reject all others, at the expense of
           | rejecting some that will halt. As long as that set is large
           | enough to be practical, the result can be useful. If you eg
           | forbid code paths that jump "backwards", you can't really
           | loop at all. Or require loops to be bounded by constants.
        
         | aksdlf wrote:
         | I'm glad to hear that Meta and Google code is "rigorous". I'd
         | prefer INRIA, universities that fund theorem provers,
         | industries where correctness matters like aerospace or
         | semiconductors.
        
           | chc4 wrote:
           | Windows doesn't use the Linux eBPF verifier, they have their
           | own implementation named PREVAIL[0] that is based on an
           | abstract interpretation model that has formal small step
           | semantics. The actual implementation isn't formally proven,
           | however.
           | 
           | 0: https://github.com/vbpf/ebpf-verifier
        
           | auspiv wrote:
           | Correctness as defined by Boeing? Or another definition?
           | 
           | "The Maneuvering Characteristics Augmentation System (MCAS)
           | is a flight stabilizing [software] feature developed by
           | Boeing that became notorious for its role in two fatal
           | accidents of the 737 MAX in 2018 and 2019, which killed all
           | 346 passengers and crew among both flights."
           | 
           | https://en.wikipedia.org/wiki/Maneuvering_Characteristics_Au.
           | ..
           | 
           | "The Boeing Orbital Flight Test (OFT) was an uncrewed orbital
           | flight test launched on December 20, 2019, but after
           | deployment, an [incorrect] 11-hour offset in the mission
           | clock of Starliner caused the spacecraft to compute that "it
           | was in an orbital insertion burn", when it was not. This
           | caused the attitude control thrusters to consume more fuel
           | than planned, precluding a docking with the International
           | Space Station.[79][80]"
           | 
           | [79] https://spacenews.com/starliner-suffers-off-nominal-
           | orbital-... "Starliner suffers "off-nominal" orbital
           | insertion after launch". SpaceNews. December 20, 2019.
           | Archived from the original on June 6, 2024. Retrieved
           | December 20, 2019.
           | 
           | [80] https://www.cnbc.com/2019/12/20/boeings-starliner-flies-
           | into... Sheetz, Michael (December 20, 2019). "Boeing
           | Starliner fails mission, can't reach space station after
           | flying into wrong orbit". CNBC. Archived from the original on
           | February 8, 2021. Retrieved December 20, 2019.
        
           | SoftTalker wrote:
           | Also that lines of code is a proxy for rigor, something new I
           | learned today. /s
        
             | sunnyps wrote:
             | I think they mean that the code base is small enough to be
             | audited thoroughly. Maybe they should reword it to be
             | clearer.
        
         | umanwizard wrote:
         | eBPF is not Turing complete. Writing it is very annoying
         | compared to writing normal C code for exactly this reason.
        
         | Retr0id wrote:
         | The halting problem cannot be solved in the general case, but
         | in many cases you _can_ prove that a program halts. eBPF only
         | allows verifiably-halting programs to run.
        
         | lolinder wrote:
         | I'm not able to comment on what this code is doing, but as for
         | the theory:
         | 
         | The halting problem is only unsolvable in the general case. You
         | cannot prove that any arbitrary piece of code will stop, but
         | you can prove that specific types of code will stop and reject
         | anything that you're unable to prove. The trivial case is "no
         | jumps"--if your code executes strictly linearly and is itself
         | finite then you know it will terminate. More advanced cases can
         | also be proven, like a loop over a very specific bound, as long
         | as you can place constraints on how the code can be structured.
         | 
         | As an example, take a look at Dafny, which places a lot of
         | restrictions on loops [0], only allowing the subset that it can
         | effectively analyze.
         | 
         | [0] https://ece.uwaterloo.ca/~agurfink/stqam/rise4fun-
         | Dafny/#h25
        
           | jkrejcha wrote:
           | Adding on (and it's not terribly relevant to eBPF), it's also
           | worth noting that there are trivial programs you can prove
           | DON'T halt.
           | 
           | A trivial example[1]:                   int main() {
           | while (true) {}             int x = foo();             return
           | x;         }
           | 
           | This program trivially runs forever[2], and indeed many
           | static code analyzers will point out that everything after
           | the `while (true) {}` line is unreachable.
           | 
           | I feel like the halting problem is incredibly widely
           | misunderstood to be similar to be about "ANY program" when it
           | really talks about "ALL programs".
           | 
           | [1]: In _C++_ , this is undefined behavior technically, but C
           | and most other programming languages define the behavior of
           | this (or equivalent) function.
           | 
           | [2]: Fun relevant xkcd: https://xkcd.com/1266/
        
             | fwip wrote:
             | EDIT: I am incorrect, please ignore. (Original text below,
             | for posterity).
             | 
             | Nit: In many languages, doesn't this depend on what foo()
             | does? e.g:                 foo() {         exit(0);       }
        
               | loeg wrote:
               | No? The foo() invocation is never reached because the
               | while loop never terminates.
        
               | fwip wrote:
               | Apologies; I misread the function call as being inside
               | the loop.
        
         | dathinab wrote:
         | the halting problem is only true for _arbitrary_ programs
         | 
         | but there are always sets of programs for which it is clearly
         | possible to guarantee their termination
         | 
         | e.g. the program `return 1+1;` is guaranteed to halt
         | 
         | e.g. given program like `while condition(&mut state) { ... }`
         | with where `condition()` is guaranteed to halt but otherwise
         | unknown is not guaranteed to halt, but if you turn it into `for
         | _ in 0..1000 { if !condition(&mut state) { break; } ... }` then
         | it is guaranteed to halt after at most 1000 iterations
         | 
         | or in other words eBPF only accepts programs which it can proof
         | will halt in at most maxins "instruction" (through it's more
         | strict then my example, i.e. you would need to unroll the for-
         | loop to make it pass validation)
         | 
         | the thing with programs which are provable halting is that they
         | tend to also not be very convenient to write and/or quite
         | limited in what you can do with them, i.e. they are not
         | suitable as general purpose programming languages at all
        
         | red_admiral wrote:
         | eBPF is not Turing-complete, I suppose.
        
           | lizxrice wrote:
           | In this talk we demo Conway's Game of Life implemented in
           | eBPF: https://www.youtube.com/watch?v=tClsqnZMN6I
        
             | lizxrice wrote:
             | I should clarify that individual eBPF programs have to
             | terminate, but more complex problems can be solved with
             | multiple eBPF programs, and can be "scheduled" indefinitely
             | using BPF timers
        
           | javierhonduco wrote:
           | It is not, programs that are accepted are proved to
           | terminate. Large and more complex programs are accepted by
           | BPF as of now, which might give the impression that it's now
           | Turing complete, when it is definitely not the case.
        
       | skywhopper wrote:
       | The implicit assumption of the article is that eBPF code can't
       | crash a kernel, but the article itself eventually admits that it
       | can and has done, including last month. eBPF is a safer way of
       | providing kernel-extension functionality, for sure, but
       | presenting it as the perfect solution is just asking to have your
       | argument dismissed. eBPF is not perfect. And there's plenty of
       | things it can't do. The very sandbox rules that limit how long
       | its programs may run and what they can do also make it entirely
       | inappropriate for certain tasks. Let's please stop pretending
       | there's a silver bullet.
        
         | efee22 wrote:
         | It's not a silver bullet, however, it is still better to
         | pushing all the panicable bugs into one community-maintained
         | section (e.g. eBPF verifier). All vendors have an incentive to
         | help get right and this is much better than every vendor
         | shipping their own panicable bugs in their own out of tree
         | kernel modules. Additionally, it's not just the industry
         | looking at eBPF, but also academia in terms of formally
         | verifying these critical sections.
        
           | lucianbr wrote:
           | "Improves kernel stability" is great. "Prevents kernel
           | crashes" is a plain lie.
           | 
           | > In the future, computers will not crash due to bad software
           | updates, even those updates that involve kernel code.
           | 
           | Come on. Computers will continue to crash in the future, even
           | when using eBPF. I am quite certain.
        
         | lucianbr wrote:
         | It's casually claiming to have solved the halting problem, at
         | least within some limited but useful context. That should be
         | impossible, and it turns out, it is.
         | 
         | I expect it can be solved within some limited contexts, but
         | those contexts are not useful, at least not at the level of
         | "generic kernel code".
        
           | red_admiral wrote:
           | It solves the halting problem by not being Turing complete. I
           | presume each eBPF runs in a context with bounded memory,
           | requested up front, for one thing; it also disallows jumps
           | unless you can prove the code still halts.
        
           | michaelt wrote:
           | eBPF started out as Berkeley Packet Filters. People wanted to
           | be able to set up complex packet filters. Things like 'udp
           | and src host 192.168.0.3 and udp[4:2]=0x0034 and
           | udp[8:2]=0x0000 and udp[12]=0x01 and udp[18:2]=0x0001 and not
           | src port 3956'
           | 
           | So BPF introduced a very limited bytecode, which is complex
           | enough that it can express long filters with lots of
           | and/or/brackets - but which is limited enough it's easy to
           | check the program terminates and is crash-free. It's still
           | quite limited - prior to ~2019, all loops had to be fully
           | unrolled at compile time as the checker didn't support loops.
           | 
           | It turned out that, although limited, this worked pretty well
           | for filtering packets - so later, when people wanted a way to
           | filter all system calls they realised they could extend the
           | battle-tested BPF system.
           | 
           | Nobody is claiming to have solved the halting problem.
        
             | lucianbr wrote:
             | Did you read the article? It says computers will not crash
             | in the future due to updates. It literally says that in the
             | very first line of the article.
             | 
             | > In the future, computers will not crash due to bad
             | software updates, even those updates that involve kernel
             | code. In the future, these updates will push eBPF code.
             | 
             | What you are claiming is completely different. A kind of
             | "firewall" for syscalls. But updates to drivers and
             | software must contain code and data. The author is not
             | talking about updates to the firewall between drivers and
             | the kernel, they talk about updating drivers themselves. It
             | literally says "updates that involve kernel code". Will the
             | kernel only consist of eBPF filtering bytecode? How could
             | that possibly work?
        
       | vfclists wrote:
       | Yep, another fix to all our problems, a new bandwagon to be
       | jumped on by wall EDR vendors, until ...
       | 
       | Here I am using the term "EDR". Until this CrowdStrike debacle
       | I'd never heard it.
       | 
       | Only tells how seriously you should take my opinions.
        
       | blinkingled wrote:
       | Ok. But the good old push code to staging / canary it before
       | mainstream updates was a simpler way of solving the same problem.
       | 
       | Crowdstrike knows the computers they're running on, it is trivial
       | to implement a system where only few designated computers
       | download and install the update and report metrics before the
       | update controller decides to push it to next set.
        
         | Archelaos wrote:
         | It would mitigate the problem, but not solve it. You can still
         | imagine a condition that only occurs after the update has been
         | rolled out everywhere. Furthermore, such a bug would still be
         | extremely problematic for the concerned customers, even if not
         | all of them were affected. In addition, it would be necessary
         | to react very quickly in the case of zero-day vulnerabilities.
        
           | tantalor wrote:
           | (semantic argument warning)
           | 
           | "Mitigation" is dealing with an outage/breakage after it
           | occurs, to reduce the impact or get system healthy again.
           | 
           | You're talking about "prevention" which keeps it from
           | happening at all.
           | 
           | Canarying is generic approach to prevention, and should not
           | be skipped.
           | 
           | Avoiding the risk entirely (eBPF) would also help prevent
           | outage, but I think we're deluding ourselves to say it
           | "solves" the problem once and for all; systems will still go
           | down due to bad deploys.
        
           | blinkingled wrote:
           | Yes, I am not arguing against having the ability to deal with
           | it quickly - I am saying canary/ staging helps you do exactly
           | that. Because as we see in the case of Intel CPUs and
           | Crowdstrike some problems or scale of some problems is best
           | prevented.
        
         | phartenfeller wrote:
         | Why trust somebody else not messing up? With that in place for
         | windows and crowdstrike billions of dollars would be saved and
         | many lives not negatively impacted ...
        
       | mrpippy wrote:
       | > Once Microsoft's eBPF support for Windows becomes production-
       | ready, Windows security software can be ported to eBPF as well.
       | 
       | This doesn't seem grounded in reality. If you follow the link to
       | the "hooks" that Windows eBPF makes available [1], it's just for
       | incoming packets and socket operations. IOW, MS is expecting you
       | to use the Berkeley Packet Filter for packet filtering. Not for
       | filtering I/O, or object creation/use, or any of the other
       | million places a driver like Crowdstrike's hooks into the NT
       | kernel.
       | 
       | In addition, they need to be in the kernel in order to monitor
       | all the other 3rd party garbage running in kernel-space. ELAM
       | (early-launch anti-malware) loads anti-malware drivers first so
       | they can monitor everything that other drivers do. I highly doubt
       | this is available to eBPF.
       | 
       | If Microsoft intends eBPF to be used to replace kernel-space
       | anti-malware drivers, they have a long, long way to go.
       | 
       | [1]: https://microsoft.github.io/ebpf-for-
       | windows/ebpf__structs_8...
        
         | shahahqq wrote:
         | I hope though that Microsoft will double down on their eBPF
         | support for Windows after this incident.
        
           | stackskipton wrote:
           | Doubt it. Microsoft is clearly over Windows. They continue to
           | produce it but every release feels like "Ugh, fine, since you
           | are paying me a ton of money."
           | 
           | Internally, Microsoft is running more and more workloads on
           | Linux and externally, I've had .Net team tell me more than
           | once that Linux is preferred environment for .Net. SQL Server
           | team continues to push hard for Linux compatibility with
           | every release.
           | 
           | EDIT: Windows Desktop gets more love because they clearly see
           | that as important market. I'm talking more Windows Server.
        
             | kevincox wrote:
             | They aren't over windows. They continue to be incredibly
             | interested in and actively developing how much money they
             | can suck from their users. Especially via various forms of
             | ads.
             | 
             | But yeah, kernel features are few and far between.
        
               | rob74 wrote:
               | See also: https://en.wikipedia.org/wiki/Cash_cow
        
               | queuebert wrote:
               | I believe the term you are looking for is "rent seeking".
               | Other than visual changes, what new functionality does
               | Windows 11 actually have that Windows XP didn't have?
               | (I'm being generous with XP, because actually 95 was
               | already mostly internet ready.) Yet how many times have
               | many of us paid for a Windows license on a new computer
               | or because the old version stopped getting updates?
        
               | pcwalton wrote:
               | > Other than visual changes, what new functionality does
               | Windows 11 actually have that Windows XP didn't have?
               | 
               | Off the top of my head, limiting myself to just NT kernel
               | stuff: WSL and Hyper-V, pseudo-terminals, condvars, WDDM,
               | DWM, elevated privilege programs on the same desktop,
               | font driver isolation, and limiting access to win32k for
               | sandboxing.
        
               | recursive wrote:
               | > what new functionality does Windows 11 actually have
               | that Windows XP didn't have? (
               | 
               | Off the top of my head, built-in bluetooth support, an
               | OS-level volume mixer, and more support for a wider
               | variety of class-compliant devices. I'm sure there are a
               | lot more, and if you actually care about the answer, I
               | don't think it would be hard to find.
        
               | queuebert wrote:
               | All of this could've been added to XP, right?
        
               | recursive wrote:
               | I don't know.
               | 
               | If it could, Then XP would just be Windows 11. What's the
               | objection here.
        
               | vitus wrote:
               | > Other than visual changes, what new functionality does
               | Windows 11 actually have that Windows XP didn't have?
               | 
               | Modern crypto ciphersuites that aren't utterly broken?
               | Your best options for symmetric crypto with XP are 3DES
               | (officially retired by NIST as of this year) and RC4
               | (prohibited in TLS as of RFC 7465).
               | 
               | (And if you think 3DES isn't totally broken by itself,
               | you're right... except for the part where the ciphersuite
               | in question is in CBC mode and is vulnerable to BEAST.
               | Thanks, mandated ciphersuites.)
        
               | wolrah wrote:
               | > Other than visual changes, what new functionality does
               | Windows 11 actually have that Windows XP didn't have?
               | 
               | XP->Vista alone brought a bunch of huge changes that
               | massively improved security (UAC), capability (64 bit
               | desktops), and future-proofing (UEFI) among many many
               | other things.
               | 
               | Some helpful Wikipedia editors have answered this
               | question in excessive detail, so I'm just going to link
               | those for more info. Also I'm going to start with what XP
               | changed from 2003 both because it makes a good comparison
               | and I'd argue 2000/NT 5.0 is the root of the modern
               | Windows era. Your next sentence after the quote implies
               | you probably won't have a problem with that.
               | 
               | * XP/2003:
               | https://en.wikipedia.org/wiki/Features_new_to_Windows_XP
               | 
               | * 2003R2: https://en.wikipedia.org/wiki/Windows_Server_20
               | 03#Windows_Se...
               | 
               | * Vista: https://en.wikipedia.org/wiki/Features_new_to_Wi
               | ndows_Vista
               | 
               | * 2008: https://en.wikipedia.org/wiki/Windows_Server_2008
               | #Features
               | 
               | * 7:
               | https://en.wikipedia.org/wiki/Features_new_to_Windows_7
               | 
               | * 2008R2: https://en.wikipedia.org/wiki/Windows_Server_20
               | 08_R2#New_fea...
               | 
               | * 8:
               | https://en.wikipedia.org/wiki/Features_new_to_Windows_8
               | 
               | * 2012: https://en.wikipedia.org/wiki/Windows_Server_2012
               | #Features
               | 
               | * 8.1: https://en.wikipedia.org/wiki/Windows_8.1#New_and_
               | changed_fe...
               | 
               | * 2012R2: https://en.wikipedia.org/wiki/Windows_Server_20
               | 12_R2#Feature...
               | 
               | * 10:
               | https://en.wikipedia.org/wiki/Features_new_to_Windows_10
               | 
               | * 2016: https://en.wikipedia.org/wiki/Windows_Server_2016
               | #Features
               | 
               | * 2019: https://en.wikipedia.org/wiki/Windows_Server_2019
               | #Features
               | 
               | * 2022: https://en.wikipedia.org/wiki/Windows_Server_2022
               | #Features
               | 
               | * 11:
               | https://en.wikipedia.org/wiki/Features_new_to_Windows_11
               | 
               | * 2025: https://learn.microsoft.com/en-us/windows-
               | server/get-started...
               | 
               | Obviously some of this will be "fluff" and that's up to
               | your own personal definitions, but to act like there
               | haven't been significant changes in every major revision
               | is just nonsense.
        
             | throwaway2037 wrote:
             | This claim about SQL Server: Is it due to disk access being
             | slower from NT kernel compared to Linux kernel?
        
               | stackskipton wrote:
               | It's just easier for everyone involved (outside Windows
               | GUI clicker admins) if it runs on Linux. Containerization
               | is easier, configuration is easier and operating system
               | is much more robust.
        
               | marcosdumay wrote:
               | There's something very wrong with Windows disk access,
               | you can see it easily by trying to run a Windows desktop
               | with rotating disks.
               | 
               | But SQL Server is in the unique position of being able to
               | optimize Windows for their own needs. So they shouldn't
               | have this kind of problem.
        
               | riskable wrote:
               | I had read previously from an unverified SQL Server
               | engineer that the thing they wanted most (with Linux
               | support) was proper containerization (from a developer
               | perspective). Apparently containers on Windows just don't
               | cut it (which is why nobody uses them in production).
               | Take it with a grain of salt though.
               | 
               | I don't think they'd ever admit that filesystem
               | performance was an issue (though we all know it is; NTFS
               | is over 30 years old!).
        
               | shawnz wrote:
               | > though we all know it is; NTFS is over 30 years old!
               | 
               | ext2, which is forwards compatible with ext3 and ext4, is
               | slightly older than NTFS
        
             | mosburger wrote:
             | > SQL Server team continues to push hard for Linux
             | compatibility with every release.
             | 
             | It's kinda funny that the DB that was once a fork of Sybase
             | that was ported to Windows is trying to make its way back
             | to Unix.
        
           | benfortuna wrote:
           | Keep in mind they don't just allow any old code to execute in
           | the kernel.
           | 
           | They do have rigorous tests (WHQL), it's just Crowdstrike
           | decided that was too burdensome for their frequent updates,
           | and decided to inject code from config files (thus bypassing
           | the control).
           | 
           | The fault here is entirely with Crowdstrike.
        
             | capitainenemo wrote:
             | Is there any evidence that the config files had arbitrary
             | code in them? The only analysis I'd seen so far indicated a
             | parsing error loading a viral signature database that was
             | routinely updated, but in this case was full of garbage
             | data.
        
               | benfortuna wrote:
               | Perhaps not verified, but some smart people do have
               | convincing arguments:
               | 
               | https://youtu.be/wAzEJxOo1ts?si=UNNxAN27VV1E6mcP&t=505
        
               | capitainenemo wrote:
               | Any article/blog/text-that-can-be-read?
        
               | alecco wrote:
               | Don't bother. He just repeats a tweet saying a
               | null+offset dereference and also the speculation of that
               | null picked from the sys file.
        
             | remram wrote:
             | How rigorous are the tests if faulty data can brick the
             | machine?
        
               | dwattttt wrote:
               | Not rigorous enough to have detected this flaw in the
               | kernel sensor, although effectively any bug in this
               | situation (an AV driver) can brick a machine. I imagine
               | WHQL isn't able to find every possible bug in a driver
               | you submit to them, they're not your QA team.
        
         | brendangregg wrote:
         | Yes, we know eBPF must attach to equivalent events to Linux,
         | but given there are already many event sources and consumers in
         | Windows, the work is to make eBPF another consumer -- not to
         | invent instrumentation frameworks from scratch.
         | 
         | Just to use an analogy: Imagine people do their banking on
         | JavaScript websites with Google Chrome, but if they use
         | Microsoft Edge it says "JavaScript isn't supported, please
         | download and run this .EXE". I'm not sure we'd be asking "if"
         | Microsoft would support JavaScript (or eBPF), but "when."
        
           | surajrmal wrote:
           | This assumes eBPF becomes the standard. It's not clear
           | Microsoft wants that. They could create something else which
           | integrates with dot net and push for that instead.
           | 
           | Also this problem of too much software running in the kernel
           | in an unbounded manner has long existed. Why should Microsoft
           | suddenly invest in solving it on Windows?
        
             | brendangregg wrote:
             | Microsoft have been driving the work to make eBPF an IETF
             | industry standard.
        
               | riskable wrote:
               | ...just like they did with Kerberos! And just like with
               | Kerberos they'll define a standard _then refuse to follow
               | it_. Instead, they will implement subtle changes to the
               | Windows implementation that make solutions that use
               | Windows eBPF incompatible with anything else, making it
               | much more difficult to write software that works with all
               | platforms eBPF (or even just its output).
               | 
               | Everything's gotta be different in Windows land.
               | Otherwise, migrating _off_ of Windows land would be too
               | easy!
               | 
               | In case you were wondering what Microsoft refused to
               | implement with its Kerberos implementation it's the DNS
               | records. Instead of following the standard (they wrote!)
               | they decided that all Windows clients will use AD's
               | Global Catalog to figure out which KDC to talk to (e.g.
               | which one is "local" or closest to the client). Since
               | nothing but Windows uses the Global Catalog they
               | effectively locked out other platforms from being able to
               | integrate with Windows Kerberos implementation _as
               | effectively_ (it 'll still work, just extremely
               | inefficiently as the clients won't know which KDC is
               | local so you either have to hard-code them into the
               | krb5.conf on every single device/server/endpoint and hope
               | for the best or DNS-and-pray you don't get a Domain
               | Controller/KDC that's on an ISDN line in some other
               | country).
        
               | MawKKe wrote:
               | Embrace, extend, ...
        
               | jrockway wrote:
               | This doesn't really seem like their strategy anymore.
               | It's not like Edge directly interprets Typescript, for
               | example. While they embraced and extended Javascript, any
               | extinguishing seems to be on the technical merits rather
               | than corporate will.
               | 
               | In the case of security scanners that run in the kernel,
               | we learned this weekend that a market need exists. The
               | mainstream media blamed Crowdstrike's bugs on "Windows".
               | Microsoft would likely like to wash its hands of future
               | events of this class. Linux-like eBPF is a path forward
               | for them that allows people to run the software they want
               | (work-slowers like Crowdstrike) while isolating their
               | reputation from this software.
        
             | philistine wrote:
             | Apple took the lead on this front. It has closed easy
             | access to the kernel by apps, and made a list of APIs to
             | try and replace the lost functionality. Anyone maintaining
             | a kernel module on macOS is stuck in the past.
             | 
             | Of course, the target area of macOS is much smaller than
             | Windows, but it is absolutely possible to kick all code,
             | malware and parasitic security services alike, from
             | accessing the kernel.
             | 
             | The safest kernel is the one that cannot be touched at
             | runtime.
        
               | Xunjin wrote:
               | > The safest kernel is the one that cannot be touched at
               | runtime.
               | 
               | Can you expand what you mean here? Because depending on
               | the application you are running, you will need at least
               | talk with some APIs to get privileged access?
        
               | odo1242 wrote:
               | Yeah, Apple doesn't allow any user code to run in kernel
               | mode without significant hoops (the kernel is code
               | signed) and tries to provide a user space API (e.g.
               | DriverKit) as an alternative for the missing
               | functionality.
               | 
               | Some things (FUSE) are still annoying though.
        
               | Agingcoder wrote:
               | Being allowed to talk to the kernel to get info and
               | running with the same privileges ( basically being able
               | to read / write any memory ) is different.
        
               | nullindividual wrote:
               | I don't think Microsoft has a choice with regards to
               | kernel access. Hell, individuals currently use
               | undocumented NT APIs. I can't imagine what happens to
               | backwards compat if kernel access is closed.
               | 
               | Apple's closed ecosystem is entirely different. They'll
               | change architectures on a whim and users will go with the
               | flow (myself included).
        
               | becurious wrote:
               | But Apple doesn't have the industrial and commercial uses
               | that Linux and Windows have. Where you can't suddenly
               | switch out to a new architecture without massive amounts
               | of validation costs.
               | 
               | At my previous job they used to use Macs to control
               | scientific instrumentation that needed a data acquisition
               | card. Eventually most of the newer product lines moved
               | over to Windows but one that was used in a validated FDA
               | regulated environment stayed on the Mac. Over time
               | supporting that got harder and harder: they managed
               | through the PowerPC to Intel transition but eventually
               | the Macs with PCIe slots went away. I think they looked
               | at putting the PCIe card in a Thunderbolt enclosure. But
               | the bigger problem is guaranteeing supply of a specific
               | computer for a reasonable amount of time. Very difficult
               | to do these days with Macs.
        
               | nullindividual wrote:
               | > validated FDA regulated environment stayed on the Mac
               | 
               | Given how long it takes to validate in a GxP environment,
               | and the cost, this makes sense.
        
               | adolph wrote:
               | Sounds like they need a nice Hackintosh for that
               | validated FDA regulation app-OS-HW combo.
        
               | becurious wrote:
               | Good luck getting that through a regulated company's
               | Quality Management System or their legal department. Way
               | too much business risk and the last thing you want is a
               | yellow or red flag to an inspector who can stop ship on
               | your product until all the recall and remediation is
               | done.
        
             | numbsafari wrote:
             | > Why should Microsoft suddenly invest in solving it on
             | Windows?
             | 
             | If they can continue to avoid commercial repercussions for
             | failing to provide a stable and secure system, then society
             | should begin to hold them to account and force them to.
             | 
             | I'm not necessarily advocating for eBPF here, either. If
             | they want to get there through some "proprietary" means, so
             | be it. Apple is doing much the same on their end by locking
             | down kexts and providing APIs for user mode system
             | extensions instead. If MS wants to do this with some kind
             | of .net-based solution (or some other fever dream out of
             | MSR) then cool. The only caveat would seem to be that they
             | are under a number of "consent decree" type agreements that
             | would require that their own extensions be implemented on a
             | level playing field.
             | 
             | So what. Windows Defender shouldn't be in the kernel any
             | more than CrowdStrike. Add an API. If that means being able
             | to send eBPF type "programs" into kernel space, cool. If
             | that means some user mode APIs, cool.
             | 
             | But lock it down already.
        
             | wongarsu wrote:
             | Microsoft has invested in solving this for at least two
             | decades, probably longer. They are just using a different
             | (arguably worse) approach to this than the Unix world.
             | 
             | In Windows 9x anti-malware would just run arbitrary code in
             | the kernel that hooked whatever it wanted. In Windows XP a
             | lot of these things got proper interfaces (like the file
             | system filter drivers to facilitate scanning files before
             | they are accessed, later replaced by minifilters), and the
             | 64 bit edition of XP introduced PatchGuard [1] to prevent
             | drivers from modifying Microsoft's kernel code.
             | Additionally Microsoft is requiring ever more static and
             | dynamic analysis to allow drivers to be signed (and thus
             | easily deployed).
             | 
             | This is a very leaky security barrier. Instead of a
             | hardware-enforced barrier like the kernel-userspace barrier
             | it's an effort to get software running at the same
             | protection level to behave. PatchGuard is a cat-and-mouse
             | game Microsoft is always loosing, and the analysis mostly
             | helps against memory bugs but can't catch everything. But
             | MS has invested a lot of work over the years in attempts to
             | make this path work. So expecting future actions isn't
             | unreasonable.
             | 
             | [1] https://en.wikipedia.org/wiki/Kernel_Patch_Protection
        
               | Analemma_ wrote:
               | This is a weird reading of history. Microsoft has spent
               | tons of effort getting as much code out of the kernel as
               | possible: Windows drivers used to be almost all kernel-
               | mode, now they're nearly all in userspace and you almost
               | never need to write a kernel-mode Windows driver unless
               | you're doing something with deep OS hooks (like CS was,
               | although apparently even that wasn't actually necessary).
               | The safeguards on kernel code are for the tiny sliver of
               | use cases left that need it, it is not Microsoft patching
               | individual holes on the leaky ship.
               | 
               | They haven't yet gone as far as Apple in banning third-
               | party kernel-mode code entirely, but I wouldn't be
               | surprised if it's coming.
        
               | tptacek wrote:
               | A thing I think a lot of people don't include in their
               | premises about Crowdstrike is that they're probably the
               | most significant aftermarket endpoint security product in
               | the world (they are what Norton and McAfee were in 2000),
               | which means they're more than large enough for malware to
               | target their code directly, which creates interesting
               | constraints for where their code can run.
               | 
               | I'm not saying I'd run it (I would not), just that I can
               | see why they have a lot of kernel-resident code.
        
         | nullindividual wrote:
         | Microsoft already has an extensible file system filter
         | capability in place, which is what current AV uses. Does it
         | make sense to add eBPF on top of that and if so, are there any
         | performance downsides, like we see with file system filters?
        
           | mauvehaus wrote:
           | They've done a technology transition once already from legacy
           | file system filter drivers to the minifilter model. If they
           | see enough benefit to another change, it wouldn't be
           | unprecedented.
           | 
           | Mind you, it looks like after 20-ish years Windows still
           | supports loading legacy filter drivers. Given the
           | considerable work that goes into getting even a simple
           | filesystem minifilter driver working reliably, it's safe to
           | assume that we'd be looking at a similarly protracted
           | transition period.
           | 
           | As to the performance, I don't think the raw infrastructure
           | to support minifilters is the major performance hit. The work
           | the drivers themselves end up doing tends to be the bigger
           | hit in my experience.
           | 
           | Some background for the curious:
           | 
           | https://www.osr.com/nt-insider/2019-issue1/the-state-of-
           | wind...
        
       | Scene_Cast2 wrote:
       | How much extra security does this provide on top of HLK?
        
       | xyzzy123 wrote:
       | So many problems though! including commercial monocultures, lack
       | of update consent, blast radius issues, etc etc. There's a
       | commons in our pockets but that is very difficult to regulate
       | for. The will keep putting the gun to your head until you keep
       | choosing the monoculture.
        
         | shahahqq wrote:
         | worrisome indeed that now the world knows how many users are
         | affected by crowdstrike so the bad guys just need to poke
         | deeper there
        
       | kevin_nisbet wrote:
       | I hate to dispute with someone like Brendan Gregg, but I'm hoping
       | vendors in this space take a more holistic approach to
       | investigating the complete failure chain. I personally tend to
       | get cautious when there is a proposal that x will solve the
       | problem that occurred on y date, especially 3 days after the
       | failure. It may be true, but if we don't do the analysis we could
       | leave ourselves open to blindspots. There may also be plenty of
       | alternative approaches that should be considered and
       | appropriately discarded.
       | 
       | I think the part I specifically dispute is the only negative
       | outcome is wasted CPU cycles. That's likely the case for the
       | class of bug, but there are plenty of failure modes where a bad
       | ruleset could badly brick a system and make it hard to recover.
       | 
       | That's not to say eBPF based security modules isn't the right
       | choice for many vendors, just that let's understand what risks
       | they do and do not avoid, and what part of the failure chain they
       | particularly address.
        
         | mirashii wrote:
         | Just because you have not been aware of the discussions on this
         | topic that have been happening for years, doesn't mean that
         | they haven't been happening. This isn't some new analysis
         | formed 3 days after an incident, this is the generally accepted
         | consensus among many experts who have been working in the
         | space, introducing these new APIs specifically to improve
         | stability, security, etc. of systems.
        
         | ohmyiv wrote:
         | > I personally tend to get cautious when there is a proposal
         | that x will solve the problem that occurred on y date,
         | especially 3 days after the failure.
         | 
         | Microsoft has been working on eBPF for a few years at least.
         | 
         | https://opensource.microsoft.com/blog/2021/05/10/making-ebpf...
         | 
         | https://lwn.net/Articles/857215/
         | 
         | If you're really concerned, they have discussions and
         | communication channels where you're invited to air your
         | concerns. They're listed on their github:
         | 
         | https://github.com/microsoft/ebpf-for-windows
         | 
         | Who knows, maybe they already have answers to your concerns. If
         | not, they can address them there.
        
       | the8472 wrote:
       | If the filters are loaded at boot and hook into everything then a
       | bug can still lock down the system to a point where it can't be
       | operated or patched anymore (e.g. because you loaded an empty
       | whitelist). So it could end up replacing a boot loop with another
       | form of DoS.
       | 
       | If microsoft includes a hardcoded whitelist that covers some
       | essentials needed for recovery that could make a bug in such a
       | tool easier to fix, but could still cause effective downtimes
       | (system running but unusuable) until such a fix is delivered.
        
       | twen_ty wrote:
       | Can someone tell me what's the advantage of eBPF over a user mode
       | driver? The article makes it look it eBPF is have your cake and
       | eat it too solution which is too good to be true? Can you run
       | graphics drivers in eBPF for example?
        
         | tptacek wrote:
         | No, you can't run arbitrary general-purpose programs in eBPF,
         | and you cannot run graphics drivers in it. You generally can't
         | run programs with unprovably bounded loops in eBPF, and your
         | program can interact with the kernel only through a small
         | series of explicitly enumerated "helpers" (for any given type
         | of eBPF program, you probably have about 20 of these in total).
        
         | chasil wrote:
         | This is the wiki. I haven't kept up, but this isn't a kernel
         | module.
         | 
         | "eBPF is a technology that can run programs in a privileged
         | context such as the operating system kernel. It is the
         | successor to the Berkeley Packet Filter (BPF, with the "e"
         | originally meaning "extended") filtering mechanism in Linux
         | _and is also used in non-networking parts of the Linux kernel
         | as well._ "
         | 
         | https://en.wikipedia.org/wiki/EBPF
        
         | bewo001 wrote:
         | AFAIK, an ebpf function can only access memory it got handed as
         | an argument or as result from a very limited number of kernel
         | functions. Your function will not load if you don't have
         | boundary checks. Fighting the ebpf validator is a bit like
         | fighting Rust's borrow checker; annoying, at times it's too
         | conservative and rejects perfectly correct code, but it will
         | protect you from panics. Loops will only be accepted if the
         | validator can prove they'll end in time; this means it can be a
         | pain to make the validator to accept a loop. Also, ebpf is a
         | processor-independent byte code, so vectorizing code is not
         | possible (unless the byte code interpreter itself does it).
         | 
         | Given all its restrictions, I doubt something complex like a
         | graphics driver would be possible. But then, I know nothing
         | about graphics driver programming.
        
           | umanwizard wrote:
           | > Fighting the ebpf validator is a bit like fighting Rust's
           | borrow checker
           | 
           | I think this undersells how annoying it is. There's a bit of
           | an impedance mismatch. Typically you write code in C and
           | compile it with clang to eBPF bytecode, which is then checked
           | by the kernel's eBPF verifier. But in some cases clang is
           | smart enough to optimize away bounds checks, but the eBPF
           | verifier isn't smart enough to realize the bound checks
           | aren't needed. This requires manual hacking to trick clang
           | into not optimizing things in a way that will confuse the
           | verifier, and sometimes you just can't get the C code to work
           | and need to write things in eBPF bytecode by hand using
           | inline assembly. All of these problems are massively
           | compounded if you need to support several different kernel
           | versions. At least with the Rust borrow checker there is a
           | clearly defined set of rules you can follow.
        
       | WaitWaitWha wrote:
       | eBPF == extended Berkeley Packet Filter
       | 
       | https://en.wikipedia.org/wiki/Berkeley_Packet_Filter
        
         | kayge wrote:
         | Thanks! This was not a familiar acronym to me... and after some
         | digging[0] apparently it's no longer an acronym:
         | 
         | "BPF originally stood for Berkeley Packet Filter, but now that
         | eBPF (extended BPF) can do so much more than packet filtering,
         | the acronym no longer makes sense. eBPF is now considered a
         | standalone term that doesn't stand for anything."
         | 
         | [0] https://ebpf.io/what-is-ebpf/
        
       | CodeWriter23 wrote:
       | > an unprecedented example of the inherent dangers of kernel
       | programming
       | 
       | I take issue with that. Kernel programming was not to blame;
       | looking up addresses from a file and accessing those memory
       | locations without any validation is. The same technique would
       | yield the same result at any Ring.
        
         | lucianbr wrote:
         | Obviously in userspace it would only crash the running program
         | and not the entire operating system? It's a significant
         | difference.
         | 
         | All of the service interruptions would have been just "computer
         | temporarily not protected by crowdstrike agent". Not the same
         | thing at all.
        
           | CodeWriter23 wrote:
           | > It's a significant difference.
           | 
           | When various apps running the world are crashing, unable to
           | execute because malware protection is failing, there is no
           | difference.
        
             | macobrien wrote:
             | _No_ difference oversells it, IMO -- the fact that the
             | entire OS crashed is what made fixing the bug so arduous,
             | since it required in-person intervention. To be sure,
             | running the code in userspace would still cause
             | unacceptable service interruptions, but the fix could be
             | applied remotely.
        
         | nine_k wrote:
         | At Ring 3 it would crash an app, not the entire OS.
         | 
         | Yes, the kernel is fine and is not to blame. But running
         | basically a rootkit controlled by a third party indeed _is_ to
         | blame.
        
           | CodeWriter23 wrote:
           | > At Ring 3 it would crash an app, not the entire OS.
           | 
           | That's still an outage for those key systems.
        
             | nequo wrote:
             | It is an outage for the monitoring system, not the system
             | that it monitors.
        
         | dwattttt wrote:
         | FWIW their configuration files can't be holding addresses;
         | those have been randomised in the kernel for at least a decade
        
       | nkozyra wrote:
       | I don't do any kernel stuff so I'm out of my element, but doesn't
       | the fact that Crowdstrike & Linux kernel eBPF already caused
       | kernel crashes[1] sort of downplay the rosiness of the state of
       | things?
       | 
       | [1]: https://access.redhat.com/solutions/7068083
        
         | guipsp wrote:
         | This is specifically addressed in the post you are replying to
        
           | nkozyra wrote:
           | Can you elaborate? What I see about Linux is that Crowdstrike
           | was in the process of adopting eBPF which is ostensibly
           | immune to kernel panics, but that issue shows their eBPF
           | implementation specifically causing a kernel panic.
        
       | mschuster91 wrote:
       | > If your company is paying for commercial software that includes
       | kernel drivers or kernel modules, you can make eBPF a
       | requirement. It's possible for Linux today, and Windows soon.
       | While some vendors have already proactively adopted eBPF (thank
       | you), others might need a little encouragement from their paying
       | customers.
       | 
       | How about Microsoft's large government and commercial customers
       | make it a requirement that MS does not develop a single new
       | feature for the next two fucking years or however long it takes
       | to go through the entirety of the Windows+Office+Exchange code
       | base and to make sure there are no security issues in there?
       | 
       | We don't need ads in the start menu, we don't need telemetry, we
       | don't need desktop Outlook becoming a rotten slow and useless web
       | app, we don't need AI, we certainly don't need Recall. We need an
       | OS environment that doesn't need a Patch Tuesday where we have to
       | check if the update doesn't break half the canary machines.
       | 
       | And while MS is _at that_ they can also take the goddamn time and
       | rework the entire configuration stack. I swear to god, it drives
       | me nuts. There 's stuff that's only accessible via the registry
       | (and there is no comprehensive documentation showing exactly what
       | _any_ key in the registry can do - large parts of that are MS-
       | internal!), there 's stuff only accessible via GPO, there's stuff
       | hidden in CPLs dating back to Windows 3.11, and there's stuff in
       | Windows' newest UI/settings framework.
        
       | throwaway2037 wrote:
       | The blog post says:                   > eBPF, which is immune to
       | such crashes.
       | 
       | I tried to Google about this, but I cannot find anything
       | definitive. It looks like you can still break things. Can an
       | expert on eBPF please comment on this claim? This is the best
       | that I could find:
       | https://stackoverflow.com/questions/70403212/why-is-ebpf-sai...
        
         | umanwizard wrote:
         | eBPF programs cannot crash the kernel, assuming there are no
         | bugs in the eBPF verifier. There have been such bugs in the
         | past but they seem to be getting more and more rare.
        
           | javierhonduco wrote:
           | Or in other parts of the kernel. It's been the case in
           | multiple occasions that buggy locking (or more generalised,
           | missing 'resource' release) has caused problems for perfectly
           | safe BPF programs. For example, see
           | https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1033398 and
           | the fix https://git.kernel.org/pub/scm/linux/kernel/git/torva
           | lds/lin...
        
             | umanwizard wrote:
             | This is actually exactly the bug I was thinking of, so fair
             | point! (I work at PS now and am aware you worked on
             | debugging it a while back).
        
           | rwmj wrote:
           | This isn't really true. eBPF programs in Linux have access to
           | a large set of helper functions written in plain C.
           | https://lwn.net/Articles/856005/
        
             | umanwizard wrote:
             | I don't see how this contradicts what I said. Indeed, there
             | are helpers, but the verifier is supposed to check that the
             | eBPF program isn't calling them with invalid arguments.
        
           | queuebert wrote:
           | I would be very hesitant to say "cannot" in a million-line C
           | code base.
        
             | umanwizard wrote:
             | Yes, bugs in Linux are possible, so there might be some
             | eBPF code that crashes the kernel. Just like bugs in Chrome
             | are possible, so there might be some JavaScript that
             | crashes the browser. Still, JavaScript is much safer than
             | native code, because fixing the bugs in one implementation
             | is a tractable problem, whereas fixing the bugs in all user
             | code is not.
        
       | __MatrixMan__ wrote:
       | Maybe we should start taking Fridays off to commemorate the
       | event, which probably would have been less bad if more people
       | spent less time with their nose to the grindstone and had more
       | time to stop and think about how it all was shaping up and how
       | they could influence that shape.
        
       | ReleaseCandidat wrote:
       | Sorry, but neither eBPF nor Rust nor formal verification nor ...
       | is going to solve that problem. Repeat after me: there are no
       | technical solutions to social problems. As long as the result of
       | such an outage is basically a "oh, a software problem! _shrug_ ",
       | _nothing_ will change.
        
       | Yawrehto wrote:
       | 1. How does eBPF solve this? It makes it more difficult, sure,
       | but it'll almost always be _possible_ to cause a crash, if you
       | try hard enough. 2. More importantly, the problem is rarely
       | fixable by changing technology, because typically, problems are
       | caused by people and their connections: social /corporate
       | pressures, profit-seeking, mental health being treated as
       | unimportant, et cetera. eBPF can't fix those, and as long as
       | corporations have social structures that penalize thoroughness
       | and caution, and incentivize getting 'the most stuff' done, this
       | will persist as a problem.
        
         | umanwizard wrote:
         | > it'll almost always be possible to cause a crash, if you try
         | hard enough.
         | 
         | If you think you know a way to crash the Linux kernel by
         | loading and running an eBPF program, you should report a bug.
        
       | uticus wrote:
       | > eBPF programs cannot crash the entire system because they are
       | safety-checked by a software verifier and are effectively run in
       | a sandbox.
       | 
       | Isn't one of the purposes of an OS to police software? I get that
       | this has to do with the OS itself, but what does watching the
       | watchers accomplish other than adding a layer which must then be
       | watched?
       | 
       | Why not reduce complexity instead of naively trusting that the
       | new complexity will be better long term?
        
         | MetaWhirledPeas wrote:
         | Right? I might spend a few minutes seeing if an AI chatbot can
         | explain all the justifications that lead to using something
         | like CrowdStrike in the first place.
        
         | riskable wrote:
         | eBPF isn't "watching the watchers" it's just a tool that lets
         | _other_ tools access low-level things in the kernel via a very
         | picky sandbox. Think of it like this:
         | 
         | Old way: Load kernel driver, hook into bazillions of system
         | calls (doing whatever it is you want to do), pray you don't
         | screw anything up (otherwise you _can_ get a panic though not
         | necessarily--Linux is quite robust).
         | 
         | eBPF way: Just ask eBPF to tell you what you want by giving it
         | some eBPF-specific instructions.
         | 
         | There's a rundown on how it works here: https://ebpf.io/what-
         | is-ebpf/
        
       | risenshinetech wrote:
       | Thank God some superheros have finally come along to make sure
       | code never crashes any computers ever again! /s
        
       | klooney wrote:
       | First io_uring, now eBPF. Kind of wild.
        
       | tracker1 wrote:
       | I don't buy it... didn't a bug from RedHat + Crowdstrike have a
       | similar panic issue? I understand in that case it was because of
       | RedHat, but still. I don't think this, by itself will change
       | much.
        
       | kaliszad wrote:
       | "These security agents will then be safe and unable to cause a
       | Windows kernel crash."
       | 
       | Unless of course there is a bug in eBPF
       | (https://access.redhat.com/solutions/7068083) @brendangregg and
       | the kernel panics/ BSoDs anyway which you mention later in the
       | article of course.
        
         | ec109685 wrote:
         | Benefit of fixing that bug is that all ebpf programs benefit
         | versus every security vendor needing to ensure they write
         | perfect c code.
        
       | throw0101d wrote:
       | Meta:
       | 
       | > _eBPF (no longer an acronym)_ [...]
       | 
       | Any reason why the official acronym was done away with?
        
         | riskable wrote:
         | Because it used to stand for extended Berkeley Packet Filter
         | and it has since moved far, far beyond just packets. It now
         | hooks into the _entire_ network stack, security, and does
         | observability /tracing for nearly anything and everything in
         | the kernel ("nearly" because some stuff runs when the kernel
         | boots up--before eBPF is loaded--and never again after that).
        
         | sandywaffles wrote:
         | Because eBPF is no longer _just_ packet filtering? It 's now
         | used in loads of hook pionts unrelated to packets or filtering
         | at all.
        
       | bfrog wrote:
       | I wonder if microkernels ever had this kind of bullshit. Had it
       | been a microkernel, would we all be sitting twiddling our thumbs
       | on friday? Hot take: No.
        
       | dveeden2 wrote:
       | So eBPF is giving us eBFP (enhanced Blue Friday Protection)?
        
       | muth02446 wrote:
       | ```The verifier is rigorous -- the Linux implementation has over
       | 20,000 lines of code -- with contributions from industry (e.g.,
       | Meta, Isovalent, Google) and academia (e.g., Rutgers University,
       | University of Washington). The safety this provides is a key
       | benefit of eBPF, along with heightened security and lower
       | resource usage. ``` Wow, 20k is not exactly encouraging. Besides
       | the extra attack surface, who can vouch for such a large code
       | base?
        
         | haberman wrote:
         | I had exactly the same thought. I don't know if that 20k number
         | was supposed to inspire confidence, but for me it did the
         | opposite. It would have inspired confidence if it was 300 lines
         | of code.
         | 
         | My impression is that the WebAssembly verifier is much simpler.
        
       | brundolf wrote:
       | This sounds like a cool technology, but this was the really
       | egregious problem:
       | 
       | > There are other ways to reduce risks during software deployment
       | that can be employed as well: canary testing, staged rollouts,
       | and "resilience engineering" in general
       | 
       | You don't need a new technology to implement basic industry-
       | standard quality control
        
       | odyssey7 wrote:
       | "The verifier is rigorous"
       | 
       | But the appeal-to-authority evidence that the article presents is
       | not.
       | 
       | "-- the Linux implementation has over 20,000 lines of code --
       | with contributions from industry (e.g., Meta, Isovalent, Google)
       | and academia (e.g., Rutgers University, University of
       | Washington). The safety this provides is a key benefit of eBPF,
       | along with heightened security and lower resource usage."
        
       | lazycog512 wrote:
       | "The major difference between a thing that might go wrong and a
       | thing that cannot possibly go wrong is that when a thing that
       | cannot possibly go wrong goes wrong it usually turns out to be
       | impossible to get at and repair."
       | 
       | - Douglas Adams
        
       | rezonant wrote:
       | > the company behind this outage was already in the process of
       | adopting eBPF, which is immune to such crashes
       | 
       | Oh I'm sure they'll find a way.
        
       | egorfine wrote:
       | One option to prevent this is to not run corporate spyware. But I
       | guess for some industries this isn't an option.
        
       | 0xbadcafebee wrote:
       | > In the future, computers will not crash due to bad software
       | updates
       | 
       | I'm still waiting on my flying car...
        
       | tgtweak wrote:
       | Even if Microsoft rolls out eBPF and mainstreams it - it will be
       | years before everything is ported over and it still won't address
       | legacy windows versions (which appear to be a good chunk of what
       | was impacted).
       | 
       | It's a move in the right direction but it probably won't fully
       | mitigate issues like this for another 5+ years.
        
       | ksec wrote:
       | The article mentions Windows and Linux. Does anyone know if there
       | will be eBPF for FreeBSD?
        
       | titzer wrote:
       | WebAssembly is a better choice for sandboxing kernel code. It has
       | a full formal specification with a mechanized proof of type
       | safety, many high-performance implementations, broad toolchain
       | support, is targetable from many languages, and a capability
       | security model.
        
       | datadeft wrote:
       | It is great that we need a linux kernel feature to be ported to
       | Windows so we don't have blue Fridays
        
       | 7e wrote:
       | eBPF will be an improvement, I'm sure, but does not mean the end
       | of bugs/DoS in software.
        
       | wiresurfer wrote:
       | Hey Brendan,
       | 
       | > If your company is paying for commercial software that includes
       | kernel drivers or kernel modules, you can make eBPF a
       | requirement.
       | 
       | Windows soon, may still be atleast a year ahead. Would that be a
       | fair statement? atleast being the operating keyword here.
       | 
       | Specifically in the context of network security software, for
       | eBPF programs to be portable across windows/linux, we would need
       | MSFT to add a lot more hooks and expose internal kernel stucts.
       | Hopefully via a common libbpf definition. Otherwise, I fear,
       | having two versions of the same product, across two OSs would
       | mean more secuirty and quality issues.
       | 
       | I guess the point I am trying to make is, we would get there, but
       | we are more than a few years away. I would love to see something
       | like cilium on vanilla windows for a Software defined Company
       | Wide network. We can then start building enterprise network
       | secutiry into it. Baby steps!
       | 
       | ---
       | 
       | btw, your talks and blog posts about bpftools is godsent!
        
       | fullspectrumdev wrote:
       | This puts an awful lot of stock in the robustness of eBPF.
       | 
       | Which is odd, given there's been a bunch of kernel privesc bugs
       | using eBPF...
        
       ___________________________________________________________________
       (page generated 2024-07-22 23:07 UTC)