[HN Gopher] Microsoft technical breakdown of CrowdStrike incident
       ___________________________________________________________________
        
       Microsoft technical breakdown of CrowdStrike incident
        
       Author : nar001
       Score  : 163 points
       Date   : 2024-07-28 19:55 UTC (3 hours ago)
        
 (HTM) web link (www.microsoft.com)
 (TXT) w3m dump (www.microsoft.com)
        
       | ldjkfkdsjnv wrote:
       | The true story is that I bet some major divisions of Crowdstrike
       | are ran by non technical people that got there through non
       | meritocratic means. Theres generally been no repercussions for
       | their underperformance, much like boeing. Crowdstrike business is
       | built on relationships, not technical supremacy. And bada bing
       | bada boom, we have a complete failure of basic technical
       | competency (no rigourous role out process).
        
         | Paianni wrote:
         | All business are built on relationships, technical competency
         | can but doesn't have to be a means to that end.
        
           | Wytwwww wrote:
           | > technical competency
           | 
           | In a more fair world (that also valued economic
           | productivity/growth more) companies which completely ignore
           | that wouldn't survive, though.
        
         | wiseowise wrote:
         | > The true story is that I bet some major divisions of
         | Crowdstrike are ran by non technical people that got there
         | through non meritocratic means.
         | 
         | Lmao.
         | 
         | > Theres generally been no repercussions for their
         | underperformance, much like boeing. Crowdstrike business is
         | built on relationships, not technical supremacy. And bada bing
         | bada boom, we have a complete failure of basic technical
         | competency (no rigourous role out process).
         | 
         | Hope you don't say anything like that in real life.
        
           | ldjkfkdsjnv wrote:
           | I try not to, catch me on the wrong day and it slips out
        
       | jacobgorm wrote:
       | I used to work on Control Flow Integrity (CFI/XFI) research at
       | places like MSR Silicon Valley and VMware, as far back as 2006.
       | Back then, sandboxing a kernel module like ramdisk.sys was doable
       | with a lot of binary rewriting magic, and later with custom LLVM
       | passes, but nowadays it should be a simple matter of compiling
       | the code with clang and the appropriate flags, to completely rule
       | out this type of memory safety error, turning a BSOD into a
       | polite log message and disabling the faulty driver.
        
         | pcwalton wrote:
         | I mean, this is basically what eBPF accomplishes in Linux.
        
           | gclawes wrote:
           | There is eBPF for Windows: https://github.com/microsoft/ebpf-
           | for-windows
           | 
           | I'd hope security products in the future leverage this more
           | than custom kernel-mode sensors.
        
             | capitainenemo wrote:
             | Was discussed on HN last week. Top comment notes the
             | Windows support is still very limited.
             | https://news.ycombinator.com/item?id=41033579
        
         | torginus wrote:
         | from what I understand, CrowdStrike has essentially put a
         | Turing-complete interpreter for their scripting language into
         | the kernel. I doubt you can do much when something is that
         | general purpose.
        
           | capitainenemo wrote:
           | Do you have more information on that? Hadn't read anything
           | about the CS kernel module running arbitrary code. Was it a
           | factor in the crash?
           | 
           | 'course, Microsoft also put turing complete scripting in ring
           | 0 years ago for performance reasons (TTFs - XML/HTML parsing
           | and GUI rendering too - to beat other OSes apparently) and
           | that certainly did lead to exploited vulnerabilities...
           | 
           | https://googleprojectzero.blogspot.com/2016/07/a-year-of-
           | win... https://gist.github.com/Nevor/ed3719dad0cf66893e42a9ba
           | 024c91... https://learn.microsoft.com/en-us/security-
           | updates/securityb... https://www.fortinet.com/blog/threat-
           | research/one-bit-to-rul... https://learn.microsoft.com/en-
           | us/security-updates/SecurityA...
           | https://news.ycombinator.com/item?id=9769099 (this comment in
           | particular https://news.ycombinator.com/item?id=9783863)
        
           | jacobgorm wrote:
           | It doesn't matter if you are doing full Fault Isolation with
           | XFI. I recommend reading the paper here https://www.usenix.or
           | g/legacy/event/osdi06/tech/full_papers/...
        
           | magicalhippo wrote:
           | Lua has been used in Linux kernel modules[1][2]. At least for
           | the ZFS case I know they were satisfied with the ability to
           | limit what the Lua scripts could do to avoid issues.
           | 
           | [1]: https://lwn.net/Articles/830154/
           | 
           | [2]: https://openzfs.github.io/openzfs-docs/man/master/8/zfs-
           | prog...
        
       | dmattia wrote:
       | I suppose I was expecting something more authoritative here. They
       | confirm that there was an attempted read-out-of-bounds, as
       | CrowdStrike said, but that's not really new information at this
       | point. I suppose we'll need to wait for more detailed analysis
       | from CrowdStrike at some point.
       | 
       | This post explains why security software has historically run in
       | kernel-mode, and really seems to be pushing new technology that
       | Microsoft has that would push security vendors into user-mode
       | (with APIs that attempt to assist with many of the reasons why
       | they have historically used kernel-mode).
       | 
       | Crowdstrike already runs in user-mode on both Mac and Linux (from
       | what I can tell), and it seems like running in user-mode on
       | Windows would significantly lessen the risk of catastrophic
       | failures like a blue-screen-of-death. I know the bulk of the
       | failures here belong to CrowdStrike, but I can't help but think
       | about the fact that Apple kicked security vendors out of kernel-
       | mode a ways back, and that if Windows had done similarly, an
       | issue like this probably wouldn't have been possible. By even
       | offering kernel-mode options to external vendors, I believe
       | Microsoft is creating risk for themselves.
        
         | Rinzler89 wrote:
         | _> I can 't help but think about the fact that Apple kicked
         | security vendors out of kernel-mode a ways back, and that if
         | Windows had done similarly, an issue like this probably
         | wouldn't have been possible_
         | 
         | Like others already said, Microsoft already tried to do that
         | with PatchGuard in 2006 with the launch of Windows Vista and
         | the likes of Symantec and McAfee complained to the EU about
         | this would harm the sales of their products, so the EU told
         | Microsoft to not do it in 2009[1].
         | 
         | Apple has the luxury of a small market share on the desktop PC
         | space to not attract the attention of the regulators, plus a
         | user base that's used to Apple constantly rewriting the OS,
         | deprecating APIs, switching CPU architectures, etc. without
         | giving a fuck about breaking backwards compatibility or cutting
         | off developers access to OS features their products use and
         | getting away with it, luxuries that Microsoft doesn't have.
         | 
         | IMHO, sticking with Window's default security and not using
         | third party anit-malware has made Windows vastly more secure
         | and rulabile than it was in the days when you'd be looking on
         | installing the likes of Symantec or McAfee for your
         | "protection" which ended up acting like malware after a while
         | throwing dark patterns at you to milk more subsection fees, so
         | as much as it hurts their sales, it's important for the
         | regulators to understand that security is far more important
         | than the regulations they put on Windows for Internet Explorer
         | and Media Player and just like Apple's apps-store, it's
         | sometimes better to let the original product maker handle
         | security and not leave the product open at all points just so
         | some of these bandits can make a living selling security for
         | it. It's like foxes complaining to regulators how chicken wire
         | is a threat to their existence.
         | 
         | [1] https://stratechery.com/2024/crashes-and-competition/
        
           | rrix2 wrote:
           | No, they engaged in malicious compliance (which many here
           | like yourself have bought in to) by rather than
           | rearchitecting their own security software to not rely on
           | trusted kernel level access, forcing every PC user in to a
           | less secure ecosystem where these things must run in the
           | kernel.
        
             | Rinzler89 wrote:
             | That's an interesting theory. Do you have any sources for
             | this? Because so far there has been no technical arguments
             | to support your PoV.
        
               | spott wrote:
               | Wasn't the whole regulatory argument that Microsoft was
               | using kernel mode in their security software, while
               | trying to relegate third party security software to user
               | land? In that case, regulators stepped in and made
               | Microsoft open up kernel mode to level the playing field.
        
             | foota wrote:
             | I don't see the malicious part of the compliance here.
             | Maybe lazy compliance?
        
             | feyman_r wrote:
             | Lots of allegations here. Can you share examples with
             | sources of other operating systems following practices
             | which you mention here? I presume Mac allows the same level
             | of access for CRWD through user mode access only and that's
             | the only way they do it too. Same goes for Linux.
             | 
             | I genuinely want to understand this - how everyone else got
             | it right and this entity got it wrong.
        
         | whimsicalism wrote:
         | The EU requires MS to provide kernel-level access to security
         | vendors due to their crazy anti-compete provisions
        
           | dmattia wrote:
           | This seems to be only partially true when I read into it. The
           | EU said that Microsoft would need to move their security
           | tools into user-space (or at least to use the same APIs as
           | are available in user-space). If they did that (like Apple
           | has done), they could kick everyone out of kernel-space if
           | they wanted.
        
         | TillE wrote:
         | > pushing new technology that Microsoft has that would push
         | security vendors into user-mode
         | 
         | This doesn't exist. It's briefly hinted at in their conclusion,
         | but right now it's simply not there.
         | 
         | There is no userspace equivalent of filesystem minifilters,
         | ObRegisterCallbacks, etc.
        
           | dmattia wrote:
           | This is fascinating, thank you for the info! If I am
           | understanding, it would have then been difficult/impossible
           | for CrowdStrike to create a user-mode only sensor without
           | these equivalent APIs.
           | 
           | So I guess I'm not sure I see validity in the claims of those
           | blaming the EU here. It seems as though the EU would have
           | allowed Microsoft to kick users out of kernel-space if they
           | had APIs that allowed making security products in user-space.
           | Like Linux/Mac already appear to have.
        
             | extraduder_ire wrote:
             | I don't think they would have had to provide those APIs in
             | the EU, so long as their own security products were "kicked
             | out" as well. That's kind of complicated to achieve in a
             | permanent and provable way. Though, windows has had support
             | for eBPF for about two years now.
        
               | TillE wrote:
               | Windows eBPF support is experimental and currently
               | provides hooks for packet filtering stuff and nothing
               | else.
               | 
               | I would be delighted if their long-term solution is eBPF
               | which provides full anti-malware hooks, but again it's
               | unfortunately not there yet.
        
         | __MatrixMan__ wrote:
         | I agree. Microsoft's core competency has traditionally been
         | backwards compatibility, but if each security vendor can tamper
         | with windows at the deepest level and is allowed to continue
         | explore all of the ways that they can leverage that... What you
         | end up with is a fleet of different windowses, each diverging
         | further with time. It dilutes the benefits brought by
         | investment into the stability of the system because whatever
         | fights are won in one fragment must be refought in others
         | before you can have confidence in the stability of all
         | fragments.
         | 
         | It seems like madness to me.
        
         | michaelt wrote:
         | _> Crowdstrike already runs in user-mode on both Mac and Linux
         | (from what I can tell),_
         | 
         | Crowdstrike provides a Linux kernel module, and expects users
         | to manually install an extra Secure Boot key for it, as part of
         | their corporate laptop setup procedure.
         | 
         | This has always seemed inadvisable to me, but checkbox checkers
         | gotta check checkboxes I guess.
        
         | GordonS wrote:
         | For one thing, being difficult to kill is huge selling point
         | for EDR - move it to user space and it's a lot easier to kill.
        
           | pas wrote:
           | A kernel-space watchdog (that checks integrity of the image)
           | would be much easier than a filter that updates from the
           | internet.
           | 
           | Sure, the whole thing is definitely a hard problem, but CS
           | fucking up even the most basic QA **and** error handling ...
           | it just shows how ridiculous their whole claim to having
           | super fancy _technology_ is.
        
       | akira2501 wrote:
       | > where security and availability are non-negotiable.
       | 
       | Yep. You just have to pretend that everyone who deployed Windows
       | had an actual competitive choice available to them.
       | 
       | > A second benefit of loading into kernel mode is tamper
       | resistance.
       | 
       | I guess availability is negotiable after all.
        
         | qsdf38100 wrote:
         | > Yep. You just have to pretend that everyone who deployed
         | Windows had an actual competitive choice available to them.
         | 
         | Could you elaborate? How is that related to security and
         | availability being non negotiable?
        
           | akira2501 wrote:
           | Microsoft's statement implies that people choose Windows
           | because of it's security and availability. Whereas most
           | people end up with Windows because the software they want to
           | run only operates on that single platform.
           | 
           | The security and availability, to the extent they even exist,
           | are clearly not part of the market's decision making process.
        
       | janice1999 wrote:
       | At least they're not blaming the European Union in this breakdown
       | (as they did earlier).
        
         | zh3 wrote:
         | Even this is written after multiple reviews by corporate
         | lawyers.
        
         | whimsicalism wrote:
         | they're right though...
        
           | DarkNova6 wrote:
           | Yes. Only Microsoft should be allowed to crash their
           | operating system. Like back in the good old days when only MS
           | could use their secret high-performance APIs.
        
             | graeme wrote:
             | Why exactly _should_ security vendors have the ability to
             | crash the operating system?
        
               | dmattia wrote:
               | They shouldn't. Microsoft should have APIs that enable
               | security vendors to work in userspace.
               | 
               | The EU didn't say that Microsoft couldn't kick vendors
               | out of the kernel, just that they couldn't do so without
               | having the APIs available that would let security vendors
               | operate outside the kernel.
               | 
               | Mac and Linux have such APIs, so CrowdStrike operates in
               | user-mode on those platforms, so those platforms do not
               | give security vendors the ability to crash the operating
               | system.
        
         | strombofulous wrote:
         | Would this still have happened if the EU had not ruled against
         | Microsoft?
        
           | PlutoIsAPlanet wrote:
           | Microsoft can kick security vendors out the kernel, but they
           | can't sell a product that uses APIs not accessible to other
           | vendors.
        
             | strombofulous wrote:
             | Sure, but my question still stands - would this have
             | happened if the EU had not made that ruling?
        
               | mort96 wrote:
               | Probably
        
               | Tuna-Fish wrote:
               | Yes. There were kernel mode drivers before that ruling,
               | it is essentially entirely irrelevant to this outage.
        
           | holsta wrote:
           | It's not about kernel access, it's about equal access to
           | avoid yet another monopoly.
           | 
           | Microsoft could have come up with a kernel API that their own
           | malware (and everyone elses) product could make use of. They
           | did not.
        
           | extraduder_ire wrote:
           | Probably not, but in more of a butterfly-effect or this
           | product not existing way.
        
         | ziml77 wrote:
         | But the blame wasn't misplaced before. People keep saying that
         | macOS does things better by forcing third parties out of the
         | kernel and instead offering APIs to do the same work in
         | userspace. Microsoft tried to do exactly this for security
         | software in Windows, but the EU didn't like that this change
         | meant that any Microsoft-developed solutions would have an
         | advantage over third party ones.
        
           | ronsor wrote:
           | I really, _really_ wish Microsoft would force third parties
           | out of the kernel.
        
           | Khaine wrote:
           | No, the EU didn't like MS having their malware protection in
           | kernel while kicking out third parties.
           | 
           | If Defender was also kicked out, it would have been fine, but
           | it wasn't.
        
           | tacticus wrote:
           | > Microsoft tried to do exactly this for security software in
           | Windows
           | 
           | Using a monopoly in one industry to capture the market in
           | another industry is what anti monopoly laws are meant to
           | prevent.
           | 
           | Microsoft was prevented because they wanted to retain a
           | commercial business in their security products having special
           | access while locking out everyone else.
        
       | rdtsc wrote:
       | > We plan to work with the anti-malware ecosystem to take
       | advantage of these integrated features to modernize their
       | approach, helping to support and even increase security along
       | with reliability.
       | 
       | > Providing safe rollout guidance, best practices, and
       | technologies to make it safer to perform updates to security
       | products.
       | 
       | > Reducing the need for kernel drivers to access important
       | security data.
       | 
       | They are being as diplomatic as they can, but it's definitely a
       | slap to CS. Read as "they don't know how to roll things out, they
       | need guidance on basic QA practices, we'll happily teach
       | them...". Then, they list a set of facilities running in user-
       | mode to avoid needing to run as many things in kernel mode.
       | 
       | I would be interested what the water cooler discussion about CS
       | was like inside Microsoft. Especially in teams needed to respond
       | to customers about "Your windows OS is broken, our hospital
       | patients are suffering...".
        
         | notepad0x90 wrote:
         | I must disagree with that take, your last quoted sentence is in
         | response to all the supposed self-proclaimed experts asking
         | "why does it need kernel access", the ones before that is to
         | limit their own liability.
         | 
         | What I've heard from people in the industry is not this silly
         | "oh no, crowdstrike is so incompetent" b.s. that is being
         | spread on sites like HN and reddit but more of an empathic "it
         | could have been us" sentiment. In this write up as well,
         | Microsoft knows they have caused their share of outages, it is
         | a technical write-up but in part, it is to cover their bases
         | for government investigations and lawsuits that will arise from
         | this incident.
         | 
         | And in part, they are also responsible for recovering from
         | third-party driver errors and repeated boot failures caused by
         | faulty drivers.
        
           | retrochameleon wrote:
           | CrowdStrike blamed their test software, but in the same
           | breath revealed that they haven't been using any canary
           | deployments. The bug that caused all this was present in
           | their kernel driver for a long time.
           | 
           | For being such a large cybersecurity player and deploying
           | updates to 8.5 million devices, their quality control
           | practices are embarrasingly lacking.
        
             | rvnx wrote:
             | Clearly incompetence to deploy from 0 to 8 million devices
             | without any gradual rollout.
             | 
             | That goes even further, because apparently they were fully
             | blind and didn't have crash metrics.
             | 
             | "Ok we push the update, and pray".
        
               | galangalalgol wrote:
               | I think it is past incompetence, and on into negligence.
               | Given the stories we have heard here about emergency
               | service failures it is likely that people died. When
               | people die due to negligence isn't that usually criminal?
        
               | rvnx wrote:
               | Can't agree more, you found the right words.
        
               | binkHN wrote:
               | And this is how the lawsuits will start.
        
               | SoftTalker wrote:
               | Who is negligent though? Crowdstrike, or the emergency
               | services that are using an OS that requires third party
               | endpoint security right out of the box in order to be
               | safely used, or the company that makes and sells that OS?
        
               | crazygringo wrote:
               | Why not both?
               | 
               | Crowdstrike, for negligently not rolling out updates
               | gradually.
               | 
               | And emergency services, if they don't have robust
               | fallback procedures/systems for when their IT system goes
               | down. I mean it's totally fine if regular doctor's visits
               | get postponed, but 911 should never go down just because
               | their computers down. Just like aircraft have redundant
               | systems, so too should 911.
               | 
               | (The company that makes and sells the OS -- I don't see
               | any negligence there, in this case. If security software
               | fundamentally requires running at the kernel level and
               | Microsoft allows that, I don't see how Microsoft can be
               | at fault.)
        
               | jmb99 wrote:
               | Yeah, I don't see how one can blame Microsoft in this
               | scenario. If you choose to run buggy kernel-level code,
               | that's on you, not the publisher of the kernel/OS.
               | Especially when the code you're running is a replacement
               | for functionality already provided by the OS. It's hard
               | to argue that MS could be negligent for "not having a
               | good enough AV/endpoint protection solution" or "allowing
               | customers to run kernel-level code."
        
             | mort96 wrote:
             | Every company I've ever been at rolls out updates slowly.
             | Rolling out a change to 8.5 million computers at the same
             | time seems ridiculous. Even the most cash strapped start-
             | ups with every incentive to cut corners tends to get staged
             | roll-outs more or less right. It's crazy.
        
               | binkHN wrote:
               | Beyond crazy. I even have a small app that never makes it
               | to production before being rolled out to internal and
               | open testing first. And, even then, it's slowly rolled
               | out to a percentage at each stage before being fully
               | deployed. One would think a major company with kernel
               | level access would do this at minimum.
        
               | geon wrote:
               | I had a fleet of only maybe 200 computers I updated
               | remotely. I did canary staged roll outs.
        
               | doubled112 wrote:
               | When I managed ~ 15 developer's Arch Linux workstations,
               | I found it very beneficial to be the canary, and then
               | rollout to a couple of the more capable of
               | troubleshooting devs, and then the rest. I can always fix
               | my own box.
               | 
               | 8.5M all at once feels insane.
        
             | duskwuff wrote:
             | > CrowdStrike blamed their test software, but in the same
             | breath revealed that they haven't been using any canary
             | deployments.
             | 
             | Their post-incident report [1] also stated that they intend
             | to improve testing by "using testing types such as: local
             | developer testing". One has to wonder what, if any, testing
             | they were doing beforehand.
             | 
             | [1]: https://www.crowdstrike.com/blog/falcon-content-
             | update-preli...
        
           | gjsman-1000 wrote:
           | Microsoft should be sued, for literally having blood on their
           | hands. There was an easily mitigated design flaw in Windows
           | that would have greatly blunted the impact.
           | 
           | https://news.ycombinator.com/item?id=41095788
        
           | freehorse wrote:
           | If "it could have been them", then I would like to read such
           | professionals write exactly about how to avoid having a
           | global outage like this again, rather than "showing empathy"
           | with a corporation. Or do we just leave it up to luck, and if
           | "it happens to them too" in a month or year, oopsies? What
           | about which practices could be improved?
        
           | michaelt wrote:
           | Anyone in the industry could have a bug get through testing.
           | 
           | Some companies could have a severe and readily reproducible
           | bug get through testing.
           | 
           | A few of those companies have a hand-rolled update mechanism,
           | and can accidentally break their ability to roll back a bad
           | release.
           | 
           | A few of _those_ companies are in a position to push a
           | release that breaks not only their own software, but the
           | entire OS.
           | 
           | Very few companies in that position would roll out to 100% of
           | client machines in a single worldwide deployment.
        
         | gnfargbl wrote:
         | It didn't read as _particularly_ diplomatic to me. In
         | particular, this paragraph..
         | 
         |  _> It is possible today for security tools to balance security
         | and reliability. For example, security vendors can use minimal
         | sensors that run in kernel mode for data collection and
         | enforcement limiting exposure to availability issues. The
         | remainder of the key product functionality includes managing
         | updates, parsing content, and other operations can occur
         | isolated within user mode where recoverability is possible._
         | 
         | ...was about as close to tetchy as a post like this would ever
         | get. Basically they are saying  "there was no good reason at
         | all why CrowdStrike had to put so much code inside the actual
         | kernel." And with the benefit of hindsight, it's a strong
         | point.
        
           | ffhhj wrote:
           | > there was no good reason at all why CrowdStrike
           | 
           | Their business is corporate spyware to surveil employees,
           | ofcourse they'll use any tactic to make it work, that's the
           | why. And their EULA states there is no liability for the
           | company:
           | 
           | https://www.crowdstrike.com/terms-conditions/
           | 
           | Dirty policies on top of dirty practices.
        
             | Rinzler89 wrote:
             | _> Their business is corporate spyware to surveil
             | employees_
             | 
             | What?! Anything you do on your corporate provided laptop is
             | always gonna be logged by IT for security in every large
             | company everywhere, that's news to nobody, but your company
             | doesn't care that you use your corpo laptop to book your
             | vacation, IT has better things to do than narc on you for
             | that.
             | 
             | If your boss wants to actually spy on you they don't need
             | Crowdstrike, there's other SW dedicated for that depending
             | on the laws in your jurisdiction but that' not what
             | Crowdstrike is for.
             | 
             | If you want complete privacy from your employer, just use
             | your personal machine for your private activities instead
             | of your work laptop, why is this so hard?
        
               | userbinator wrote:
               | Speak for yourself. There are still companies who don't
               | treat their employees like idiots and actually trust
               | them. Let's not normalise pervasive surveillance.
        
               | Rinzler89 wrote:
               | _> There are still companies who don't treat their
               | employees like idiots and actually trust them._
               | 
               | Yeah sure, but wow many of those are large non-tech
               | companies?
               | 
               | You massively overestimate the tech competency of the
               | average PC user if you think it's normal in most
               | companies to not have security monitoring solutions in
               | place or over the internat activity. In our latest
               | phishing test IT did, several users fell for the trap,
               | despite it being a tech company. There's always gonna be
               | someone careless one day and companies want insurance
               | policies against that.
               | 
               | Having such solutions in place doesn't mean the company
               | doesn't trust you, it's more like that old Russian
               | proverb, "trust but verify", and for ticking security
               | compliance boxing as an insurance policy.
               | 
               | Everyone makes mistakes, it's only human. So more like,
               | speak for yourself, if you think your internet activity
               | at work isn't logged anywhere.
        
         | holsta wrote:
         | > they need guidance on basic QA practices
         | 
         | Microsoft has a loooong history of botched (security) updates,
         | so I'm not hopeful they can teach Crowdstrike much.
        
           | SoftTalker wrote:
           | Yes, quite the epitome of throwing stones from a glass house.
        
           | Rinzler89 wrote:
           | Do you happen to have a list of that "loooong history" of
           | botched (security) updates?
           | 
           | I can only find a couple of examples after googling, which a
           | bit smaller than a "loooong history" you're talking about, so
           | unless Microsoft is paying Google to delete results, maybe
           | you're mistaken.
        
             | SoftTalker wrote:
             | This is a company whose OS could not even be installed on a
             | live network without getting rooted within a few minutes.
             | Anybody who was paying attention knew that you didn't use
             | any new Windows release until at least the first service
             | pack had come out.
             | 
             | Granted that was a while back but painful memories die
             | hard.
        
               | Rinzler89 wrote:
               | _> This is a company whose OS could not even be installed
               | on a live network without getting rooted within a few
               | minutes. _
               | 
               | That was WIndows XP 20 years ago. Please bring arguments
               | about modern Window 11 security which is the current up
               | to date product they're selling and supporting not
               | scenarios that haven't happened in 20 years.
        
               | clwg wrote:
               | First thing that comes to mind is that Recall stuff from
               | a month ago, they also release updates[0] that crash
               | machines.
               | 
               | [0] https://www.tomsguide.com/news/windows-11-update-
               | causing-blu...
        
               | TeMPOraL wrote:
               | Recall actually is a brilliant idea, and I dreamed of
               | something like it for a long time, and so did plenty
               | people here. It's just not something you can trust a
               | third-party business with, whether it's a fly-by-night
               | startup or an international megacorporation known to be
               | openly promiscuous with advertisers.
               | 
               | This is basically "take a screenshot every 30 seconds and
               | compile it into a timelapse", but on steroids, and the
               | same appeal, and arguments wrt. who gets to run it on
               | whose machines, all apply.
        
               | clwg wrote:
               | The functionality does seem intriguing, that doesn't
               | change it's security profile which was poorly thought out
               | and implemented.
        
               | feyman_r wrote:
               | Ignoring Windows Insider reports is bad. However, how
               | many endpoints having issues (out of a billion+) is
               | 'acceptable' after an update? We live in a news hype
               | cycle so clearly even the one wrong failure will make it
               | up somewhere.
               | 
               | However, without metrics that show BSoDs from patches
               | (which MS will likely never share), it's hard to see if
               | things have improved or regressed. If they regressed,
               | someone up in their leadership chain is hopefully
               | following the constructive discussion here.
        
               | Eduard wrote:
               | for a loooong history, you have to look in the past
        
               | Rinzler89 wrote:
               | Ah, well, if only things of the past were useful today,
               | I'd still have hair, and probably millions made form
               | right investments, but unfortunately, it's what's
               | happening today that actually matters.
        
               | echoangle wrote:
               | So you asked for proof of a long history and are now
               | surprised that the examples are all from the past?
        
               | squigz wrote:
               | GP is absolutely correct. You can't ask for examples of a
               | long history of something, then dismiss examples from,
               | you know, history.
        
               | tacticus wrote:
               | The company that let every db server have global admin
               | creds and 0 logging on their cloud platform?
               | 
               | That didn't run their own enhanced visibility on their
               | own cloud platform.
        
               | lightedman wrote:
               | Vulnerabilities present in 2000 are showing up still in
               | modern Windows versions.
               | 
               | https://www.csoonline.com/article/564499/3-leaked-nsa-
               | exploi...
               | 
               | You have no idea the cruft and technical debt Windows has
               | in order to maintain its backwards compatibility.
        
               | TeMPOraL wrote:
               | That's a bit disingenuous, though. That was, as
               | 'Rinzler89 points out, some 20 years ago. Back then, any
               | Linux distro would've definitely been much safer option,
               | because after installing _you couldn 't even connect it
               | to the network_, because it had no support for your cable
               | modem or wireless card, and that's assuming you didn't
               | fuck up your MBR with LiLo for the 20th time. Ask me how
               | I know.
               | 
               | Both OS families have changed much since that time.
        
               | rvnx wrote:
               | Oh sweet, this laptop has a PCMCIA Wi-Fi card!
               | 
               | That'd be cool if one day I can get the laptop running on
               | battery and not just on sector.
               | 
               | Let me just setup it.
               | 
               | Wait a second, how do I wake up the screen again and get
               | out of this hibernation stage ?
               | 
               | Why are all the fans stuck in 100% now ?
               | 
               | Errr, first let's see if I can get the trackpad working.
        
               | feyman_r wrote:
               | Agree.I also remember those days when it was so hard to
               | get Linux to just boot up and get your display working
               | correctly- it was almost like a rite of passage. It was
               | just proving grounds for how much of an expert you were
               | and the number of hours you spent in front of the PC,
               | just to get things working.
               | 
               | My point is, good and bad memories will always stand out.
        
             | system2 wrote:
             | Anyone who worked in IT knows this, it is not something
             | rare. Literally every month, for example one from last
             | month:
             | 
             | https://www.techradar.com/computing/windows/windows-11-upda
             | t...
             | 
             | This is the main reason every IT professional I know
             | disables auto updates of windows and manually trigger
             | updates after testing (hopefully) on multiple dummy
             | machines on the network.
             | 
             | I personally remember booting to safe mode to remove
             | Windows updates to rescue the computers more than I can
             | count.
        
               | Rinzler89 wrote:
               | Examples like that one I also found, but that's not
               | really a "looooong list". If people can only show one
               | single example as an argument it's kind of a moot point.
        
               | system2 wrote:
               | You'd experience at least 3-5 per year if you work in IT.
               | There really is a long list but since it is not my
               | argument, I won't list them after searching for an hour.
               | The list starts early 2000s, not recent.
               | 
               | EDIT: Whatever, I will do the search for you since you
               | cannot use google:
               | 
               | https://www.pcgamer.com/an-odd-bug-in-this-months-
               | windows-10...
               | 
               | https://www.windowslatest.com/2023/10/22/windows-11-octob
               | er-...
               | 
               | https://www.bleepingcomputer.com/news/microsoft/windows-1
               | 0-e...
               | 
               | https://www.windowslatest.com/2023/02/09/microsoft-
               | confirms-...
               | 
               | https://www.windowslatest.com/2023/07/16/windows-11-kb502
               | 818...
               | 
               | These are just the last quarter of 2023. There is over
               | 2000 news but I won't link them Use keywords: Windows
               | Update, Crash, and use the date option on google go
               | before 2023.
        
             | GordonS wrote:
             | There's only been a few _really_ bad ones, but Microsoft
             | botch Windows updates quite regularly.
        
               | Rinzler89 wrote:
               | _> but Microsoft botch Windows updates quite regularly_
               | 
               | OK, please show us the proof then. If it's as regularly
               | indeed like you claim then it must be documented
               | somewhere as a greppable list.
               | 
               | Tech blogs would have a field day getting traffic on
               | their site by keeping track and documenting on such
               | regular mistakes if they exist.
        
               | Brybry wrote:
               | It's frequent enough that people pay money for
               | AskWoody[1] to tell them when it's safe to patch or what
               | patches to skip.
               | 
               | [1] https://www.askwoody.com/ms-defcon-system/
        
               | Rinzler89 wrote:
               | Quote, from the website:
               | 
               |  _" In general, I apply Windows Defender updates as soon
               | as they're available. Why? Microsoft hasn't screwed up
               | any of them too badly. You're better off applying those
               | updates than letting them slide for a week or two."_
        
               | Brybry wrote:
               | Yep, Microsoft does a good job with Windows Defender
               | (antivirus) updates.
               | 
               | It's the other Windows Updates that they botch frequently
               | enough to make people wary of patching immediately.
        
               | oxygen_crisis wrote:
               | Here's >100 of them in the past ~8 months:
               | 
               | https://www.manageengine.com/patch-
               | management/resources/micr...
        
               | feyman_r wrote:
               | Where can I find a list for all OSes? I'd assume such a
               | list would have known issues with X11 etc. I want to
               | ensure it's not a case of surviviorship bias.
        
             | mrj wrote:
             | Well, from the news this morning:
             | 
             | https://www.forbes.com/sites/daveywinder/2024/07/27/microso
             | f...
        
           | drdec wrote:
           | >> they need guidance on basic QA practices
           | 
           | > Microsoft has a loooong history of botched (security)
           | updates, so I'm not hopeful they can teach Crowdstrike much.
           | 
           | Experience is the best teacher
        
           | cogman10 wrote:
           | And they've learned a lot from it. For example, MS no longer
           | universally deploys updates across the world, they have a
           | slower rollout to avoid just such an incident.
        
         | f001 wrote:
         | I can tell you they're quite unhappy about it. Have a friend
         | working there who frustratedly says it wasn't their fault
         | every-time it comes up. Which is quite often and at every
         | social occasion since.
        
           | fishywang wrote:
           | but it's kind of their fault? they designed the api that way,
           | they decided what can be done in userland and what must be
           | done via kernel. they at least _allowed_ it to happen every
           | time.
        
             | lozenge wrote:
             | You can't just let people do anything from userland, the
             | performance would tank. As for restricting kernelland, EU
             | competition regulators would not be happy if MS was the
             | only one able to write anti virus software that runs in
             | kernelland.
        
               | justinclift wrote:
               | Or perhaps MS could actually try to think of a working
               | solution, rather than blame legislation they don't like?
               | 
               | "Don't blame us! Blame the EU for stopping our monopoly!"
               | 
               | Yeah, good luck with that. ;)
        
       | gjsman-1000 wrote:
       | Reminder that Microsoft _could_ have programmed Windows to notice
       | if a driver has caused a blue screen three times in a row, and
       | prompt if you want to disable the driver on boot. After all,
       | Windows _already_ collects how many times a driver causes a
       | crash. This would have made recovery one click instead of heading
       | into Safe Mode and needing BitLocker keys.
       | 
       | But they didn't.
       | 
       | And Microsoft, I argue, _also_ has blood on their hands for every
       | hospital this hit. Giving users a prompt to disable the driver,
       | after three successive failed boots, would have saved lives.
        
         | t-writescode wrote:
         | How would that have helped the server farms that were
         | experiencing the issue?
        
           | gjsman-1000 wrote:
           | Oh I don't know, the servers down, you go and look as a
           | technician, and you simply see a screen saying:
           | 
           | "CSAgent.sys has caused a failure to boot three times in a
           | row. Do you want to disable this driver? <Yes> <No>."
           | 
           | You click "Yes." Server reboots with CloudStrike driver
           | disabled. The day is saved in 5 minutes instead of building a
           | custom ISO image or going on a BitLocker key recovery spree.
        
             | politelemon wrote:
             | It would still have required on site presence and
             | interaction during which there is still downtime, so this
             | accomplishes marginally small gains.
        
               | gjsman-1000 wrote:
               | At the same time though, imagine you woke up and
               | CloudStrike hit your organization.
               | 
               | For most users, they'll try clicking "Yes." And then it's
               | back to work. After all, "No" just causes a blue screen
               | again, might as well try the other path.
               | 
               | This would have been the difference between the IT
               | department handling 10,000+ calls or a few hundred (plus
               | sending out a bulletin) in many, many organizations. It
               | also could have saved billions at this point.
               | 
               | Heck, it would have saved _lives_ in hospitals.
        
               | jonathantf2 wrote:
               | But then you have millions of endpoints booting without
               | malware protection
        
               | echoangle wrote:
               | Can you cite some reports of deaths caused by the outage?
        
           | morkalork wrote:
           | Instead of prompting on the screen, disable the driver and
           | boot directly into a recovery state that has networking
           | enabled so sysadmins can push scripts and fixes? As long as
           | it's not a network driver you'd be okay.
        
             | t-writescode wrote:
             | Disable the driver that is explicitly there to protect from
             | malware and attacks?
             | 
             | Wouldn't malware just use that as an attack vector?
        
         | danlitt wrote:
         | Nooo you don't understaaaand kernel code is special :'(
         | actually BSOD was a desired feature because CrowdStrike is a
         | Security (TM) application.
         | 
         | (sorry, just simulating the replies I get when I post this
         | sentiment anywhere else)
        
           | gjsman-1000 wrote:
           | That's very easily mitigated - write the security software so
           | it can't crash. Like, you know, drivers should be written.
           | 
           | Malware can't crash a well-written or memory-safe driver, so
           | it will never be unloaded. Problem solved.
        
             | echoangle wrote:
             | Writing the driver so it can't crash is the hard part, I
             | think the developers knew that this was the goal.
        
         | Uvix wrote:
         | Those hospitals chose to deploy software that didn't support
         | testing. The blood is on their own hands.
        
         | galangalalgol wrote:
         | I think sueing MS for the behavior that ensued when people
         | installed a rootkit directly into the kernel and opened all the
         | ports on their network to let that rootkit get used, is...
         | excessive. Both MS and CS should have had a fail to previous
         | good kernel ability, but the negligence here is clearly with CS
         | for not even trying a blank data file in the automated tests
         | for a piece of safety critical software, and then not using
         | canary deployments before pushing to millions of devices.
        
         | crazygringo wrote:
         | Do I like your idea for that?
         | 
         | Yes, absolutely. It's a clever idea.
         | 
         | But do I think Microsoft was _negligent_ in not building that?
         | 
         | No, I think that's going too far. Windows already has Safe Mode
         | -- as you note -- to allow for manual recovery, which is what
         | people are using.
         | 
         | I don't think it makes sense for it to be Microsoft's legal
         | responsibility to protect its users from software with a
         | critical bug that wasn't written by Microsoft. Otherwise, where
         | would it end? If a third-party program tries to delete all your
         | user data, is it Microsoft's legal responsibility to check
         | whenever a process is deleting a lot of data, and intervene
         | with a confirmation dialog? Is it Microsoft's responsibility to
         | protect you from all malware and ransomware, no matter how
         | cleverly written? Is it Microsoft's responsibility to
         | constantly cache program state on disk so that when a third-
         | party program crashes, you don't lose your data since your last
         | save?
         | 
         | I think that's going too far, in terms of legal obligation.
        
           | grumpyprole wrote:
           | Microsoft may be negligent in selling a product unsuitable
           | for these applications. Windows is unsuitable precisely
           | because it can be brought down by third party updates, such
           | that it cannot recover without manual intervention by
           | technical experts. Third party vendors are forced into
           | writing unsafe kernel drivers because Microsoft does not
           | provide sufficient user mode APIs.
           | 
           | Windows has a dated design and a security model no longer fit
           | for purpose. As for your other example, it _could_ be
           | protecting users from malicious programs that may delete
           | data, simply by having a better security model, like Android
           | and iOS.
        
             | crazygringo wrote:
             | I don't think Microsoft can be negligent here, because
             | Windows isn't being brought down by _Microsoft_ updates.
             | 
             | Somebody bought Windows, and bought CrowdStrike.
             | CrowdStrike is negligent, and possibly also the person/org
             | who chose to rely on Windows+CrowdStrike without a backup
             | plan if that resulted in further damages to others.
             | 
             | Third party vendors are absolutely not "forced into writing
             | unsafe kernel drivers". They can properly test things to
             | write safer code (which CrowdStrike infamously didn't). And
             | kernel mode is fundamentally required for security software
             | like this, as far as I understand.
             | 
             | And using app-based mobile OS's is not necessarily a useful
             | comparison point. They are limited in all sorts of ways
             | that desktop OS's are not -- and don't you hear people here
             | on HN constantly _complaining_ about that? A better
             | comparison point is macOS and Linux. CrowdStrike also
             | crashed Linux, and macOS still lets you bypass SIP if you
             | want to.
        
         | Khaine wrote:
         | AFAIK Windows does do that, except for drivers that are marked
         | as required for boot. CrowdStrike's drivers are marked as
         | required for boot.
        
         | ziml77 wrote:
         | Imagine I've installed CrowdStrike under the assumption that it
         | makes my system more secure. Why would I want the OS to allow
         | the system to boot up in a less secure state by providing a
         | prompt for that? Most users will just click whichever option
         | gets them back up and running and IT will have no control over
         | that.
        
         | nerdjon wrote:
         | This is very much a "easier said than done" situation that I
         | would think Hacker News of all places would be better about
         | when it comes to "just" doing something in code.
         | 
         | First Windows already does something similar. After 3 it is
         | supposed to boot into WindowsRE which gives you options to
         | revert to a previous version, uninstall updates, and I believe
         | also reverts configurations like recent driver installations.
         | 
         | The problem here though, CrowdStrike itself didn't update. It
         | updated a definition file (last I saw at least) and that likely
         | would not have been caught by Windows as a new version.
         | 
         | Also frankly, not super thrilled at the idea of Windows just
         | deciding to disable/uninstall something except for rolling back
         | (so a previously working config) due to how things could
         | interact. This situation could have been far worse and harder
         | to recover from.
         | 
         | In this case maybe Windows could have noticed that the
         | configuration update is what was causing it and rolled that
         | back, but it's possible it would have just re-downloaded the
         | file when it started back up anyways.
         | 
         | Regarding saved lives, do we actually know that anyone's lives
         | were lost due to this? My local hospitals were still performing
         | emergency surgery.
        
       | zh3 wrote:
       | I do have to wonder how many agonising layers of review this went
       | through with the marketing and legal departments as part of
       | shifting the blame.
       | 
       | If you want to decide which OS/distros to avoid for critical
       | stuff, look to see who's learning from the incident (even if not
       | bitten by it) compared to those saying "it wasn't our fault" (and
       | that's not just MS).
        
       | EasyMark wrote:
       | Oh I like this breakdown a lot. Fairly technical, links to
       | resources used, flow of debug process, didn't get lost in a the
       | weeds of details and how clever they were. I wish more debug
       | retrospectives were like this. It seems like you end up with 100
       | pages of analysis or a couple of vague paragraphs.
        
       | userbinator wrote:
       | I'm going to be the controversial one here and say that, as bad
       | as CrowdStrike was, the alternative of having only Microsoft be
       | able to decide what people can do is far worse. I've already seen
       | many others trying to use this incident to advocate for digital
       | totalitarianism.
        
       | superposeur wrote:
       | I'm surprised no one has yet noted that Microsoft itself is a
       | chief CrowdStrike competitor.
        
         | tonymet wrote:
         | i thought crowdstrike provided features that go beyond windows
         | defender. is there another MS product that competes?
        
           | superposeur wrote:
           | FWIW, here is CrowdStrike's own comparison of features:
           | 
           | https://www.crowdstrike.com/compare/crowdstrike-vs-
           | microsoft...
        
       | tonymet wrote:
       | Did either release from MS or Crowdstrike explain how this crash
       | bypassed QC? I'm still baffled that a 100% repro crash even made
       | it anywhere near the later stages of QC. This is something easily
       | caught by the earliest CI phases , at the developer and at least
       | first build automation phase, let alone human QC.
        
         | magicalhippo wrote:
         | From what I read in the previous thread, their test environment
         | didn't actually test what was deployed.
         | 
         | That is, there was a post-test pre-distribution packaging
         | stage, and that's where the distributed file(s) got f'ed up.
         | 
         | If true that would explain how it got past their testing, but
         | would also be an incredible lack of competence IMHO.
         | 
         | But yeah, curious if there's been some more concrete details
         | there.
        
           | tonymet wrote:
           | I heard something similar. that they deploy content
           | separately from code, but they don't test all of the
           | combinations of code + content. This crash was from "stable"
           | code in the driver mixed with a corrupt or incomplete content
           | file (config, etc) , triggering the null-ptr exception .
           | 
           | Sounds like one of those companies where you get hired and
           | are shocked by the sausage factory you just stepped into
        
             | rvnx wrote:
             | In February they added new code that allows to spy/block
             | named pipes.
             | 
             | Named pipes are pipes of communication that processes can
             | use to talk to each other, as an alternative to sockets.
             | 
             | For example Chrome uses them between the user interface and
             | the actual page renderer.
             | 
             | In March they tested it in staging, said it was fine,
             | pushed to prod with few rules in April, still looked fine.
             | 
             | In July they added a new rule, which was deployed to 100%
             | immediately, as from their perspective, a new entry in a
             | database definition doesn't need testing nor canary deploy
             | 
             | (which is still irresponsible, because bad rules could
             | cause damage as well like any security/antivirus software,
             | even if the parser didn't crash, but it could have blocked
             | legitimate actions or files)
        
         | pas wrote:
         | lack of fuzzing for their "parser + updater"
        
       | DeathMetal3000 wrote:
       | "Windows has announced a commitment around the Rust programming
       | language as part of Microsoft's Secure Future Initiative (SFI)
       | and has recently expanded the Windows kernel to support Rust."
        
       ___________________________________________________________________
       (page generated 2024-07-28 23:02 UTC)