hngopher.com

       [HN Gopher] Preliminary Post Incident Review
       ___________________________________________________________________
        
       Preliminary Post Incident Review
        
       Author : cavilatrest
       Score  : 104 points
       Date   : 2024-07-24 04:35 UTC (18 hours ago)
        
 (HTM) web link (www.crowdstrike.com)
 (TXT) w3m dump (www.crowdstrike.com)
        
       | Scaevolus wrote:
       | "problematic content"? It was a file of all zero bytes. How
       | exactly was that produced?
        
         | Zironic wrote:
         | If I had to guess blindly based on their writeup, it would seem
         | that if their Content Configuration System is given invalid
         | data, instead of aborting the template, it generates a null
         | template.
         | 
         | To a degree it makes sense because it's not unusual for a
         | template generator to provide a null response if given invalid
         | inputs however the Content Validator then took that null and
         | published it instead of handling the null case as it should
         | have.
        
           | jiggawatts wrote:
           | Returning null instead of throwing an exception when an error
           | occurs is the quality of programming I see from junior
           | outsourced developers.
           | 
           | "if (corrupt digital signature) return null;"
           | 
           | is the type of code I see buried in authentication systems,
           | gleefully converting what should be a sudden stop into a
           | shambling zombie of invalid state and null reference
           | exceptions fifty pages of code later in some controller
           | that's already written to the database on behalf of an
           | attacker.
           | 
           | If I peer into my crystal ball I see a vision of CrowdStrike
           | error handling code quality that looks suspiciously the same.
           | 
           | (If I sound salty, it's because I've been cleaning up their
           | mess since last week.)
        
         | chrisjj wrote:
         | The've said the crash was not related to those zero bytes.
         | https://www.crowdstrike.com/blog/falcon-update-for-windows-h...
        
       | romwell wrote:
       | This reads like a bunch of baloney to obscure the real problem.
       | 
       | The only relevant part you need to see:
       | 
       |  _> Due to a bug in the Content Validator, one of the two
       | Template Instances passed validation despite containing
       | problematic content data_.
       | 
       |  _Problematic content_? Yeah, this is telling exactly nothing.
       | 
       | Their mitigation is "ummm we'll test more and maybe not roll the
       | updates to everyone at once", without any direct explanation on
       | how that would prevent this from happening again.
       | 
       | Conspicuously absent:
       | 
       | -- fixing whatever produced "problematic content"
       | 
       | -- fixing whatever made it possible for "problematic content" to
       | cause "ungraceful" crashes
       | 
       | -- rewriting code so that the Validator and Interpreter would use
       | the _same_ code path to catch such issues in test
       | 
       | -- allowing the sysadmins to roll back updates before the OS
       | boots
       | 
       | -- diversifying the test environment to include _actual_ client
       | machine configurations running _actual_ releases _as they would
       | be received by clients_
       | 
       | This is a nothing sandwich, not an incident review.
        
         | Zironic wrote:
         | >Add additional validation checks to the Content Validator for
         | Rapid Response Content. A new check is in process to guard
         | against this type of problematic content from being deployed in
         | the future.
         | 
         | >Enhance existing error handling in the Content Interpreter.
         | 
         | They did write that they intended to fix the bugs in both the
         | validator and the interpreter. Though it's a big mystery to me
         | and most of the comments on the topic how an interpreter that
         | crashes on a null template would ever get into production.
        
           | romwell wrote:
           | _> They did write that they intended to fix the bugs_
           | 
           | I strongly disagree.
           | 
           |  _Add additional validation_ and _enhance error handling_ say
           | as much as  "add band-aids and improve health" in response to
           | a broken arm.
           | 
           | Which is not something you'd want to hear from a kindergarten
           | that sends your kid back to you with shattered bones.
           | 
           | Note that the things I said were missing _are_ indeed missing
           | in the  "mitigation".
           | 
           | In particular, additional checks and "enhanced" error
           | handling don't address:
           | 
           | -- the fact that it's possible for content to be
           | "problematic" for interpreter, but not the validator;
           | 
           | -- the possibility for "problematic" content to crash the
           | entire system still remaining;
           | 
           | -- nothing being said about _what_ made the content
           | "problematic" (spoiler: a bunch of zeros, but they didn't say
           | it), _how_ that content was produced in the first place, and
           | the possibility of it happening in the future still
           | remaining;
           | 
           | -- the fact that their clients _aren 't in control of their
           | own systems_, have no way to roll back a bad update, and can
           | have their entire fleet disabled or compromised by
           | CrowdStrike in an instant;
           | 
           | -- the business practices and incentives that didn't result
           | in all their "mitigation" steps ( _as well as_ steps
           | addressing the above) being _already_ implemented still
           | driving CrowdStrike 's relationship with its employees and
           | clients.
           | 
           | The latter is particularly important. This is less a software
           | issue, and more an _organizational_ failure.
           | 
           | Elsewhere on HN and reddit, people were writing that
           | ridiculous SLA's, such as "4 hour response to a
           | vulnerability", make it practically impossible to release
           | well-tested code, and that reliance on a rootkit for security
           | is little more than CYA -- which means that the writing was
           | on the wall, and _this will happen again_.
           | 
           | You can't fix bad business practices with bug fixes and
           | improved testing. And you can't fix what you don't look into.
           | 
           | Hence my qualification of this "review" as a red herring.
        
             | chrisjj wrote:
             | > people were writing that ridiculous SLA's, such as "4
             | hour response to a vulnerability
             | 
             | I didn't see people explaining why this was ridiculous.
             | 
             | > make it practically impossible to release well-tested
             | code
             | 
             | That falsely presumes the release must be code.
             | 
             | CrowdStrike say of the update that caused the crash: "This
             | Rapid Response Content is stored in a proprietary binary
             | file that contains configuration data. It is not code or a
             | kernel driver."
        
               | romwell wrote:
               | _> I didn't see people explaining why this was
               | ridiculous._
               | 
               | Because of how it affects priorities and incentives.
               | 
               | E.g.: as of 2024, CrowdStrike didn't implement staggered
               | rollout of Rapid Response content. If you spend a second
               | thinking why that's the case, you'll realize that _rapid_
               | and _staggered_ are literally antithetical.
               | 
               |  _> CrowdStrike say of the update that caused the crash:
               | "This Rapid Response Content is stored in a proprietary
               | binary file that contains configuration data. It is not
               | code or a kernel driver."_
               | 
               | Well, they are lying.
               | 
               | The data that you feed into an _interpreter_ is code, no
               | matter what they want to call it.
        
             | GoblinSlayer wrote:
             | It's not your kid, so "improve health" is the industry
             | standard response here.
        
               | romwell wrote:
               | True, but the question is why they can keep getting away
               | with that.
        
           | TheFragenTaken wrote:
           | What validates the Content Validator? A Content Validator
           | Validator?
        
         | citrin_ru wrote:
         | > fixing whatever made it possible for "problematic content" to
         | cause "ungraceful" crashes
         | 
         | Better not only fix this specific bug but continuously use
         | fuzzing to find more places where external data (including
         | updates) can trigger a crash (or worse RCE)
        
           | romwell wrote:
           | That is indeed necessary.
           | 
           | But it seems to me that putting the interpreter in a place in
           | the OS where it _can_ cause a system crash with the be the
           | behavior that it 's allowed to do is a fundamental design
           | choice that is not at all addressed by fuzzing.
        
             | cratermoon wrote:
             | An interpreter that handles data downloaded from the
             | internet even. That's an exploit waiting to happen.
        
         | acdha wrote:
         | Also "using memory safe languages for critical components" and
         | "detecting failures to load and automatically using the last-
         | known-good configuration"
        
       | romwell wrote:
       | Copying my content from the duplicate thread[1] here:
       | 
       | This reads like a bunch of baloney to obscure the real problem.
       | The only relevant part you need to see:
       | 
       |  _> Due to a bug in the Content Validator, one of the two
       | Template Instances passed validation despite containing
       | problematic content data_.
       | 
       |  _Problematic content_? Yeah, this is telling exactly nothing.
       | 
       | Their mitigation is "ummm we'll test more and maybe not roll the
       | updates to everyone at once", without any direct explanation on
       | how that would prevent this from happening again.
       | 
       | Conspicuously absent:
       | 
       | -- fixing whatever produced "problematic content"
       | 
       | -- fixing whatever made it possible for "problematic content" to
       | cause "ungraceful" crashes
       | 
       | -- rewriting code so that the Validator and Interpreter would use
       | the _same_ code path to catch such issues in test
       | 
       | -- allowing the sysadmins to roll back updates before the OS
       | boots
       | 
       | -- diversifying the test environment to include _actual_ client
       | machine configurations running _actual_ releases _as they would
       | be received by clients_
       | 
       | This is a nothing sandwich, not an incident review.
       | 
       | [1] https://news.ycombinator.com/item?id=41053703
        
         | dang wrote:
         | > Copying my content from the duplicate thread[1] here
         | 
         | Please don't do this! It makes merging threads a pain because
         | then we have to find the duplicate subthreads (i.e. your two
         | comments) and merge the replies as well.
         | 
         | Instead, if you or anyone will let us know at
         | hn@ycombinator.com which threads need merging, we can do that.
         | The solution is deduplication, not further duplication!
        
       | rurban wrote:
       | They bypassed the tests and staged deployment, because their
       | previous update looked good. Ha.
       | 
       | What if they implemented a release process, and follow it? Like
       | everyone else does. Hackers at the workplace, sigh.
        
         | CommanderData wrote:
         | They know better obviously, transcending process and
         | bureaucracy.
        
         | fulafel wrote:
         | Also it must have been a manual testing effort, otherwise there
         | would be no motive to skip it. IOW, missing test automation.
        
         | throwaway7ahgb wrote:
         | Where do you see that, it looks like there was a bug in the
         | template tester? Or you mean the manual tests?
        
           | kasabali wrote:
           | > Based on the testing performed before the initial
           | deployment of the Template Type (on March 05, 2024), trust in
           | the checks performed in the Content Validator, and previous
           | successful IPC Template Instance deployments, these instances
           | were deployed into production.
        
       | lopkeny12ko wrote:
       | > When received by the sensor and loaded into the Content
       | Interpreter, problematic content in Channel File 291 resulted in
       | an out-of-bounds memory read triggering an exception. This
       | unexpected exception could not be gracefully handled, resulting
       | in a Windows operating system crash (BSOD).
       | 
       | There was a popular X thread [1] that a lot of people took issue
       | with over the past week, but it hit the nail on the head for the
       | root cause. I suspect a lot of HNers wbo criticized him now owe
       | him an apology.
       | 
       | [1] https://x.com/Perpetualmaniac/status/1814376668095754753
        
         | nemetroid wrote:
         | No, the Twitter poster is still wrong.
        
         | bdjsiqoocwk wrote:
         | It's called Twitter.
        
           | joenot443 wrote:
           | No, the name's been changed.
        
             | bdjsiqoocwk wrote:
             | No it hasn't.
        
               | justusthane wrote:
               | I'm not a fan of Musk or of the-platform-formerly-known-
               | as-Twitter, but I'm not sure how you can insist that the
               | name hasn't been changed.
        
         | mc32 wrote:
         | How can these companies be certified and compliant, etc., and
         | then in practice have horrible SDLC?
         | 
         | What was the impact of diverse teams (offshoring)? Often
         | companies don't have necessary checks to ensure disparateness
         | of teams does not impact quality. Maybe it was zero or maybe it
         | was more.
        
           | hulitu wrote:
           | > How can these companies be certified and compliant, etc.,
           | and then in practice have horrible SDLC?
           | 
           | Checklists ?
        
           | hello_moto wrote:
           | You're saying there exist a complex software system without a
           | bug despite following best practices to the dot and certified
           | + compliant?
        
             | mc32 wrote:
             | No, but their release process should catch major bugs such
             | as this. After internal QA, you release to small internal
             | dev team, then to select members of other depts willing to
             | dog-food it, then limited external partners then GA? Or
             | something like that so that you have multiple opportunities
             | to catch weird software/hardware interactions before
             | bringing down business critical systems for major and small
             | companies around the planet?
        
               | hello_moto wrote:
               | > After internal QA, you release to small internal dev
               | team, then to select members of other depts willing to
               | dog-food it, then limited external partners then GA
               | 
               | What about AV definition update for 0day swimming in the
               | tubes right now?
        
               | mc32 wrote:
               | Sure, those have happened before, but nothing with an
               | impact like last weekend. That's inexcusable. At least
               | definitions can update themselves out of trouble.
        
               | hello_moto wrote:
               | What do you refer to "those have happened before"?
               | 
               | Isn't that what happened? Not a software update, not an
               | AV-definition update but more so an AV-definition "data"
               | update. At least that's how I interpret "Rapid Response
               | Content"
        
           | YZF wrote:
           | Standards generally don't mandate specifics and almost
           | certainly nothing specific to SDLC. At least none I've heard
           | of. Things like FIPS and ISO and SOC2 generally prescribe
           | having a certain process, sometimes they can mandate some
           | specifics (e.g. what ciphers for FIPS). Maybe there should be
           | some release process standards that prescribe how this is
           | done but I'm not aware of any. I think part of the problem is
           | the standard bodies don't really know what to prescribe, this
           | sort of has to come from the community. Maybe not unlike the
           | historical development of other engineering professions.
           | Today being compliant with FIPS doesn't really mean you're
           | secure and being SOC2 compliant doesn't really mean customer
           | data is safe etc. It's more some sort of minimal bar in
           | certain areas of practice and process.
        
             | mc32 wrote:
             | Sadly, I agree with your take. All it is is a minimum bar.
             | Many who don't have the above are even worse --tho not
             | necessarily, but as a rule probably yes.
        
         | cangencer wrote:
         | The thread is still wrong, since it was a OOB memory read, not
         | a missing null pointer check as claimed. 0x9c is likely the
         | value that just happened to be in the OOB read.
        
         | fulafel wrote:
         | "Incoming data triggered a out-of-bound memory access bug" is
         | hardly a useful conclusion for a root cause investigation (even
         | if you are of the faith of the single root cause).
        
         | maples37 wrote:
         | https://threadreaderapp.com/thread/1814376668095754753.html
        
         | cataflam wrote:
         | Not really, that thread showed only superficial knowledge and
         | analysis, far from hitting the nail on the head, for anyone
         | used to assembly/reverse engineering. Then goes on to make
         | provably wrong assumptions and comments. There is actually a
         | null check (2 even!) just before trying the memory access. The
         | root cause is likely trying to access an address that's coming
         | from some uninitialized or wrongly initialized or non-
         | deterministically initialized array.
         | 
         | What it did well was explaining the basics nicely for a wide
         | audience who knows nothing about a crash dump or invalid memory
         | access, which I guess made the post popular. Good enough for a
         | general public explanation, but doesn't pass the bar for an
         | actual technical one to any useful degree.
         | 
         | I humbly concur with Tavis' take
         | 
         | https://x.com/taviso/status/1814762302337654829
         | 
         | Here are some others for more technically correct details: -
         | https://x.com/patrickwardle/status/1814343502886477857 -
         | https://x.com/tweetingjose/status/1814785062266937588
        
       | nodesocket wrote:
       | Why do they insist on using what sounds like military pseudo
       | jargon throughout the document?
       | 
       | ex. sensors? I mean how about hosts, machines, clients?
        
         | com wrote:
         | It's endemic in the tech security industry - they've been
         | mentally colonised by ex-mil and ex-law enforcement (wannabe
         | mil) folks for a long time.
         | 
         | I try to use social work terms and principles in professional
         | settings, which blows these people's minds.
         | 
         | Advocacy, capacity evaluation, community engagement, cultural
         | competencies, duty of care, ethics, evidence-based
         | intervention, incentives, macro-, mezzo- and micro-practice,
         | minimisation of harm, respect, self concept, self control etc
         | etc
         | 
         | It means that my teams aren't focussed on "nuking the bad guys
         | from orbit" or whatever, but building defence in depth and
         | indeed our own communities of practice (hah!), and using
         | psychological and social lenses as well as tech and adversarial
         | ones to predict, prevent and address disruptive and dangerous
         | actors.
         | 
         | YMMV though.
        
           | phaedrus wrote:
           | Even computer security itself is a metaphor (at least in its
           | inception). I often wonder what if instead of using terms
           | like access, key, illegal operation, firewall, etc. we'd
           | instead chosen metaphors from a different domain, for example
           | plumbing. I'm sure a plumbing metaphor could also be found
           | for every computer security concern. Would be so quick to
           | romanticize as well as militarize a field dealing with
           | "leaks," "blockages," "illegal taps," and "water quality"?
        
             | com wrote:
             | "Fatbergs" expresses some things delivered by some teams
             | very eloquently for me!
        
         | notepad0x90 wrote:
         | because those things are different? i didn't see a single
         | "military" jargon. there is absolutely nothing unusual about
         | their wording. It's like someone saying "why do these people
         | use such nerdy words" regarding HN content.
        
         | justusthane wrote:
         | The sensor isn't a host, machine, or a client. It's the
         | software component that detects threats. I guess maybe you
         | could call it an agent instead, but I think sensor is pretty
         | accepted terminology in the EDR space - it's not specific to
         | Crowdstrike.
        
       | coremoff wrote:
       | Such a disingenuous review; waffle and distraction to hide the
       | important bits (or rather bit: bug in content validator) behind a
       | wall of text that few people are going to finish.
       | 
       | If this is how they are going to publish what happened, I don't
       | have any hope that they've actually learned anything from this
       | event.
       | 
       | > Throughout this PIR, we have used generalized terminology to
       | describe the Falcon platform for improved readability
       | 
       | Translation: we've filled this PIR with technobable so that when
       | you don't understand it you won't ask questions for fear of
       | appearing slow.
        
         | notepad0x90 wrote:
         | > "behind a wall of text that few people are going to finish."
         | 
         | heh? it's not that long and very readable.
        
           | coremoff wrote:
           | I disagree; it's much longer than it needs to be, is filled
           | with pseudo-technoese to hide that there's little of
           | consequence in there, and the tiny bit of real information in
           | there is couched with distractions and unnecessary detail.
           | 
           | As I understand it, they're telling us that the outage was
           | caused by an unspecified bug in the "Content Validator", and
           | that the file that was shipped was done so without testing
           | because it worked fine last time.
           | 
           | I think they wrote what they did because they couldn't
           | publish the above directly without being rightly excoriated
           | for it, and at least this way a lot of the people reading it
           | won't understand what they're saying but it sounds very
           | technical.
        
             | notepad0x90 wrote:
             | no, it's one of most well written PIR's I've seen. It
             | establishes terms and procedures after communicating that
             | this isn't an RCA, then they detail the timeline of tests
             | and deployments done and what went wrong. They were not
             | excessively verbose or terse. This is the right way of
             | communicating to the intended audience. It is both
             | technical people, executives and law makers alike that will
             | be reading this. They communicated their findings clearly
             | without code, screenshots, excessive historical details and
             | other distractions.
        
         | hello_moto wrote:
         | In the current situation, it's better to be complete no?
         | 
         | This information is not just for _you_.
        
       | CommanderData wrote:
       | "We didn't properly test our update."
       | 
       | Should be the tldr. On threads there's information about
       | CrordStrike slashing QA team numbers, whether that was a factor
       | should be looked at.
        
         | hulitu wrote:
         | They write perfect software. Why should they test it ? /s
        
       | Ukv wrote:
       | A summary, to my understanding:
       | 
       | * Their software reads config files to determine which behavior
       | to monitor/block
       | 
       | * A "problematic" config file made it through automatic
       | validation checks "due to a bug in the Content Validator"
       | 
       | * Further testing of the file was skipped because of "trust in
       | the checks performed in the Content Validator" and successful
       | tests of previous versions
       | 
       | * The config file causes their software to perform an out-of-
       | bounds memory read, which it does not handle gracefully
        
         | Narretz wrote:
         | * Further testing of the file was skipped because of "trust in
         | the checks performed in the Content Validator" and successful
         | tests of previous versions
         | 
         | that's crazy. How costly can it be to test the file fully in a
         | CI job? I fail to see how this wasn't implemented already.
        
           | modestygrime wrote:
           | Just reeks of incompetence. Do they not have e2e smoketests
           | of this stuff?
        
           | denton-scratch wrote:
           | > How costly can it be to test the file fully in a CI job?
           | 
           | It didn't need a CI job. It just needed one person to
           | actually boot and run a Windows instance with the Crowdstrike
           | software installed: a smoke test.
           | 
           | TFA is mostly an irrelevent discourse on the product
           | architecture, stuffed with proprietary Crowdstrike jargon,
           | with about a couple of paragraphs dedicated to the actual
           | problem; and they don't mention the non-existence of a smoke
           | test.
           | 
           | To me, TFA is _not_ a signal that Crowdstrike has a plan to
           | remediate the problem, yet.
        
             | hrpnk wrote:
             | They mentioned they do dogfooding. Wonder why it did not
             | work for this update.
        
               | xh-dude wrote:
               | They discuss dogfooding "Sensor Content", which isn't
               | "Rapid Response Content".
               | 
               | Overall the way this is written up suggests some cultural
               | problems.
        
               | stefan_ wrote:
               | You just got tricked by this dishonest article. The whole
               | section that mentions dogfooding is only about actual
               | updates to the kernel driver. This was not a kernel
               | driver update, the entire section is irrelevant.
        
       | red2awn wrote:
       | > How Do We Prevent This From Happening Again?
       | 
       | > Software Resiliency and Testing
       | 
       | > * Improve Rapid Response Content testing by using testing types
       | such as:
       | 
       | > * Local developer testing
       | 
       | So no one actually tested the changes before deploying?!
        
         | Narretz wrote:
         | And why is it "local developer testing" and not CI/CD. This
         | makes them look like absolute amateurs.
        
           | belter wrote:
           | > This makes them look like absolute amateurs.
           | 
           | This applies also to all Architects and CTO's at all these
           | Fortune 500 companies, who allowed these self updating
           | systems into their critical systems.
           | 
           | I would offer a copy of Antifragile to each of these teams:
           | https://en.wikipedia.org/wiki/Antifragile_(book)
           | 
           | "Every captain goes down with every ship"
        
             | acdha wrote:
             | Architects likely do not have a choice. These things are
             | driven by auditors and requirements for things like
             | insurance or PCI and it's expensive to protest those. I
             | know people who've gone full serverless just to lop off the
             | branches of the audit tree about general purpose server
             | operating systems, and now I'm wondering whether anyone is
             | thinking about iOS/ChromeOS for the same reason.
             | 
             | The more successful path here is probably demanding proof
             | of a decent SDLC, use of memory-safe languages, etc. in
             | contract language.
        
               | belter wrote:
               | > Architects likely do not have a choice.
               | 
               | Architects don't have a choice, CTO are well paid to golf
               | with the CEO and _delegate_ to their teams, Auditors just
               | audit but are not involved with the technical
               | implementations, Developers just develop according to the
               | Spec, and Security team just are a pain in the ass.
               | Nobody owns it...
               | 
               | Everybody get's well paid, and at the end we have to get
               | lessons learned...It's a s*&^&t show...
        
               | mardifoufs wrote:
               | Some industries are forced by regulation or liability to
               | have something like crowdstrike deployed on their
               | systems. And crowdstrike doesn't have a lot of
               | alternatives that tick as many checkboxes and are as
               | widely recognized.
        
               | belter wrote:
               | Please give me an example of that _specific_ regulation.
        
               | hello_moto wrote:
               | Seems like everyone thinks that Execs play golf with
               | another Execs to seal the deal regardless how b0rken the
               | system is.
               | 
               | That CTO's job is on the line if the system can't meet
               | the requirement, more so if the system is fucked.
               | 
               | To think that every CTO is dumbass is like saying
               | "everyone is stupid, except me, of course"
        
               | belter wrote:
               | Not all CTO...but you just saw hundreds of companies, who
               | could do better....
        
               | hello_moto wrote:
               | That is true, hundred companies have no backup process in
               | place :D
        
           | RaftPeople wrote:
           | The fact that they even listed "local developer testing" is
           | pretty weird.
           | 
           | That is just part of the basic process and is hardly the
           | thing that ensures a problem like this doesn't happen.
        
           | radicaldreamer wrote:
           | They don't care, CI/CD, like QA, is considered a cost center
           | for some of these companies. The cheapest thing for them is
           | to offload the burden of testing every configuration onto the
           | developer, who is also going to be tasked with shipping as
           | quickly as possible or getting canned.
           | 
           | Claw back executive pay, stock, and bonuses imo and you'll
           | see funded QA and CI teams.
        
           | hyperpape wrote:
           | It sure sounds like the "Content Validator" they mention is a
           | form of CI/CD. The problem is that it passed that validation,
           | but was capable of failing in reality.
        
         | spacebanana7 wrote:
         | This also becomes a security issue at some point. If these
         | updates can go in untested, what's to stop a rogue employee
         | from deliberately pushing a malicious update?
         | 
         | I know insider threats are very hard to protect against in
         | general but these companies must be the most juicy target for
         | state actors. Imagine what you could do with kernel space code
         | in emergency services, transport infrastructure and banks.
        
       | nine_zeros wrote:
       | Will managers continue to push engineers even when engineers
       | advise to go slower or no?
        
         | bobwaycott wrote:
         | Always.
        
       | Cyphase wrote:
       | Lots of words about improving testing of the Rapid Response
       | Content, very little about "the sensor client should not ever
       | count on the Rapid Response Content being well-formed to avoid
       | crashes".
       | 
       | > Enhance existing error handling in the Content Interpreter.
       | 
       | That's it.
       | 
       | Also, it sounds like they might have separate "validation" code,
       | based on this; why is "deploy it in a realistic test fleet" not
       | part of validation? I notice they haven't yet explained anything
       | about what the Content Validator does to validate the content.
       | 
       | > Add additional validation checks to the Content Validator for
       | Rapid Response Content. A new check is in process to guard
       | against this type of problematic content from being deployed in
       | the future.
       | 
       | Could it say any less? I hope the new check is a test fleet.
       | 
       | But let's go back to, "the sensor client should not ever count on
       | the Rapid Response Content being well-formed to avoid crashes".
        
         | hun3 wrote:
         | Is error handling enough? A perfectly valid rule file could
         | hang (but not outright crash) the system, for example.
        
           | throwanem wrote:
           | If the rules are Turing-complete, then sure. I don't see
           | enough in the report to tell one way or another; the way
           | rules are made to sound as if filling templates about equally
           | suggests either (if templates may reference other templates)
           | and there is not a lot more detail. Halting seems relatively
           | easy to manage with something like a watchdog timer, though,
           | compared to a sound, crash- and memory-safe* parser for a
           | whole programming language, especially if that language
           | exists more or less by accident. (Again, no claim; there's
           | not enough available detail.)
           | 
           | I would not want to do any of this directly on metal, where
           | the only safety is what you make for yourself. But that's the
           | line Crowdstrike are in.
           | 
           | * By EDR standards, at least, where "only" one reboot a week
           | forced entirely by memory lost to an unkillable process
           | counts as exceptionally _good._
        
           | ReaLNero wrote:
           | Perhaps set a timeout on the operation then? Given this is
           | kernel it's not as easy as userspace, but I'm sure you could
           | request to set a interrupt on a timer.
        
         | SoftTalker wrote:
         | > it sounds like they might have separate "validation" code
         | 
         | That's what stood out to me. From the CS post: "Template
         | Instances are created and configured through the use of the
         | Content Configuration System, which includes the Content
         | Validator that performs validation checks on the content before
         | it is published."
         | 
         | Lesson learned, a "Validator" that is not actually the _same
         | program_ that will be parsing /reading the file in production,
         | is not a complete test. It's not entirely useless, but it
         | doesn't guarantee anything. The production program could have a
         | latent bug that a completely "valid" (by specification) file
         | might trigger.
        
           | modestygrime wrote:
           | I'd argue that it is completely useless. They have the actual
           | parser that runs in production and then a separate "test
           | parser" that doesn't actually reflect reality? Why?
        
       | Cyphase wrote:
       | Direct link to the PIR, instead of the list of posts:
       | https://www.crowdstrike.com/blog/falcon-content-update-preli...
        
         | Cyphase wrote:
         | The article link has been updated to that; it used to be the
         | "hub" page at https://www.crowdstrike.com/falcon-content-
         | update-remediatio...
         | 
         | Some updates from the hub page:
         | 
         | They published an "executive summary" in PDF format:
         | https://www.crowdstrike.com/wp-content/uploads/2024/07/Crowd...
         | 
         | That includes a couple of bullet points under "Third Party
         | Validation" (independent code/process reviews), which they
         | added to the PIR on the hub page, but not on the dedicated PIR
         | page.
         | 
         | > Updated 2024-07-24 2217 UTC
         | 
         | > ### Third Party Validation
         | 
         | > - Conduct multiple independent third-party security code
         | reviews.
         | 
         | > - Conduct independent reviews of end-to-end quality processes
         | from development through deployment.
        
       | squirrel wrote:
       | There's only one sentence that matters:
       | 
       | "Provide customers with greater control over the delivery of
       | Rapid Response Content updates by allowing granular selection of
       | when and where these updates are deployed."
       | 
       | This is where they admit that:
       | 
       | 1. They deployed changes to their software directly to customer
       | production machines; 2. They didn't allow their clients any
       | opportunity to test those changes before they took effect; and 3.
       | This was cosmically stupid and they're going to stop doing that.
       | 
       | Software that does 1. and 2. has absolutely no place in critical
       | infrastructure like hospitals and emergency services. I predict
       | we'll see other vendors removing similar bonehead "features" very
       | very quietly over the next few months.
        
         | hello_moto wrote:
         | > I predict we'll see other vendors removing similar bonehead
         | "features" very very quietly over the next few months.
         | 
         | Absolutely this is what will happen.
         | 
         | I don't know much about the practice of AV definition-like
         | feature across Cybersecurity but I would imagine there might be
         | a possibility that no vendors do rolling update today because
         | it involves Opt-in/Opt-out which might influence the vendor's
         | speed to identify attack which in turns affect their
         | "Reputation" as well.
         | 
         | "I bought Vendor-A solution but I got hacked and have to pay
         | Ransomware" (with a side note: because I did not consume the
         | latest critical update of AV definition) is what Vendors
         | worried.
         | 
         | Now that this Global Outage happened, it will change the
         | landscape a bit.
        
           | XlA5vEKsMISoIln wrote:
           | >Now that this Global Outage happened, it will change the
           | landscape a bit.
           | 
           | I seriously doubt that. Questions like "why should we use
           | CrowdStrike" will be met with "suppose they've learned their
           | lesson".
        
         | mr_mitm wrote:
         | Does anyone test their antivirus updates individually as a
         | customer? I thought they happen multiple times a day, who has
         | time for that?
        
           | packetlost wrote:
           | Yes? Not consumers typically, but many IT departments with
           | certain risk profiles absolutely do.
        
         | packetlost wrote:
         | I really wish we would get some regulation as a result of this.
         | I know people that almost died due to hospitals being down. It
         | should be absolutely mandatory for users, IT departments, etc.
         | to be able to control when and where updates happen on their
         | infrastructure but *especially* so for critical infrastructure.
        
         | SketchySeaBeast wrote:
         | Unfortunately, putting the onus on risk adverse organizations
         | like hospitals and governments to validate the AV changes means
         | they just won't get pushed and will be chronically exposed.
         | 
         | That said, maybe Crowdstrike should considering validating
         | every step of the delivery pipeline before pushing to
         | customers.
        
           | throw0101d wrote:
           | > _Unfortunately, putting the onus on risk adverse
           | organizations like hospitals and governments to validate the
           | AV changes means they just won 't get pushed and will be
           | chronically exposed._
           | 
           | I have a similar feeling.
           | 
           | At the very least perhaps have an "A" and a "B" update
           | channel, where "B" is _x_ hours behind A. This way if, in an
           | HA configuration, one side goes down there 's time to deal
           | with it while your B-side is still up.
        
           | dmazzoni wrote:
           | Why can't they just do it more like Microsoft security
           | patches, making them mandatory but giving admins control over
           | when they're deployed?
        
             | XlA5vEKsMISoIln wrote:
             | That would be equivalent to asking "would you prefer your
             | fleet to bluescreen now, or later" in this case.
        
               | jaggederest wrote:
               | Presumably you could roll out to 1% and report issues
               | back to the vendor before the update was applied to the
               | last 99%. So a headache but not "stop the world and
               | reboot" levels of hassle.
        
         | bawolff wrote:
         | > They deployed changes to their software directly to customer
         | production machines; 2. They didn't allow their clients any
         | opportunity to test those changes before they took effect; and
         | 3. This was cosmically stupid and they're going to stop doing
         | that.
         | 
         | Is it really all that surprising? This is basically their
         | business model - its a fancy virus scanner that is supposed to
         | instantly respond to threats.
        
         | 98codes wrote:
         | Combined with this, presented as a change they could
         | _potentially_ make, it 's a killer:
         | 
         | > Implement a staggered deployment strategy for Rapid Response
         | Content in which updates are gradually deployed to larger
         | portions of the sensor base, starting with a canary deployment.
         | 
         | They weren't doing any test deployments at all before blasting
         | the world with an update? Reckless.
        
         | nathanlied wrote:
         | >I predict we'll see other vendors removing similar bonehead
         | "features" very very quietly over the next few months.
         | 
         | If indeed this happens, I'd hail this event as a victory
         | overall; but industry experience tells me that most of those
         | companies will say "it'd never happen with us, we're a lot more
         | careful", and keep doing what they're doing.
        
       | brianmback wrote:
       | The only thing worse than the Crowdstrike incident is the UI they
       | used to publish this report
        
       | duped wrote:
       | Here is my summary with the marketing bullshit ripped out.
       | 
       | Falcon configuration is shipped with both direct driver updates
       | ("sensor content"), and out of band ("rapid response content").
       | "Sensor Content" are scripts (*) that ship with the driver.
       | "Rapid response content" are data that can be delivered
       | dynamically.
       | 
       | One way that "Rapid Response Content" is implemented is with
       | templated "Sensor Content" scripts. CrowdStrike can keep the
       | behavior the same but adjust the parameters by shipping "channel"
       | files that fill in the templates.
       | 
       | "Sensor content", including the templates, are a part of the
       | normal test and release process and goes through
       | testing/verification before being signed/shipped. Customers have
       | control over rollouts and testing.
       | 
       | "Rapid Response Content" is deployed through a different channel
       | that customers do not have control over. Crowdstrike shipped a
       | broken channel file that passed validation but was not tested.
       | 
       | They are going to fix this by adding testing of "rapid response"
       | content updates and support the same rollout logic they do for
       | the driver itself.
       | 
       | (*) I'm using the word "script" here loosely. I don't know what
       | these things are, but they sound like scripts.
       | 
       | ---
       | 
       | In other words, they have scripts that would crash given garbage
       | arguments. The validator is supposed to check this before they
       | ship, but the validator screwed it up (why is this a part of
       | release and not done at runtime? (!)). It appears they did not
       | test it, they do not do canary deployments or support rollout of
       | these changes, and everything broke.
       | 
       | Corrupting these channel files sounds like a promising way to
       | attack CS, I wonder if anyone is going down that road.
        
         | hello_moto wrote:
         | > Corrupting these channel files sounds like a promising way to
         | attack CS, I wonder if anyone is going down that road.
         | 
         | Would have happened long time ago if it was that easy no?
        
           | duped wrote:
           | How do we know it hasn't?
        
             | hello_moto wrote:
             | If it happened, the industry would have known by now.
             | 
             | The group behind it will come out to the public.
        
               | sudosysgen wrote:
               | This would be the kind of vulnerability that would be
               | worth millions of dollars and used for targeted attacks
               | and/or by state actors. It could take years to uncover
               | (like Pegasus, which took 5 years to be discovered) or
               | never be uncovered at all.
        
       | EricE wrote:
       | A file full of zeros is an "undetected error"? Good grief.
        
         | dmazzoni wrote:
         | It wasn't a file full of zeros that caused the problem.
         | 
         | While some affected users did have a file full of zeros, that
         | was actually a result of the system in the process of trying to
         | download an update, and not the version of the file that caused
         | the crash.
        
       | jvreeland wrote:
       | I really dislike reading website that take over half the screen
       | and make me read off to the side like this. I can fix it by
       | zooming in but I don't understand why they thought making the
       | navigation take up that much of the screen or not be collapsable
       | was a good move.
        
       | sgammon wrote:
       | 1) Everything went mostly well
       | 
       | 2) The things that did not fail went so great
       | 
       | 3) Many many machines did not fail
       | 
       | 4) macOS and Linux unaffected
       | 
       | 5) Small lil bug in the content verifier
       | 
       | 6) Please enjoy this $10 gift card
       | 
       | 7) Every windows machine on earth bsod'd but many things worked
        
         | mikequinlan wrote:
         | Regarding the gift card, TechCrunch says
         | 
         | "On Wednesday, some of the people who posted about the gift
         | card said that when they went to redeem the offer, they got an
         | error message saying the voucher had been canceled. When
         | TechCrunch checked the voucher, the Uber Eats page provided an
         | error message that said the gift card "has been canceled by the
         | issuing party and is no longer valid.""
         | 
         | https://techcrunch.com/2024/07/24/crowdstrike-offers-a-10-ap...
        
           | sgammon wrote:
           | There's a KB up about this now. To use your voucher, reboot
           | into safe mode and...
        
             | mikequinlan wrote:
             | On another forum a person replied...
             | 
             | >The system to redeem the card is probably stuck in a boot
             | loop
        
         | rm445 wrote:
         | Fun post, but I'll state the obvious because I think many
         | people do believe that every Windows machine BSOD'd. It was
         | only ones with Crowdstrike software. Which is apparently very
         | common but isn't actually pre-installed by Microsoft in
         | Windows, or anything like that.
         | 
         | Source: work in a Windows shop and had a normal day.
        
           | sgammon wrote:
           | True, and definitely worth a mention. This is only
           | Microsoft's fault insofar as it was possible at all to crash
           | this way, this broadly, with so little recourse via remote
           | tooling.
        
       | cataflam wrote:
       | Besides missing the actual testing (!), the staged rollout (!),
       | looks like they also weren't fuzzing this kernel driver that
       | routinely takes instant worldwide updates. Oops.
        
         | l00tr wrote:
         | check their developer github, "i write kernel-safe bytecode
         | interpreters" :D, https://github.com/bushidocodes/bushidocodes
        
           | brcmthrowaway wrote:
           | He Codes With Honor(tm)
        
       | gostsamo wrote:
       | Cowards. Why don't you just stand up and admit that you didn't
       | bother testing everything you send to production?
       | 
       | Everything else is smoke and the smell of sulfur.
        
       | aeyes wrote:
       | Do you see how they only talk about technical changes to prevent
       | this from happening again?
       | 
       | To me this was a complete failure on the process and review side.
       | If something so blatantly obvious can slip through, how could
       | ever I trust them to prevent an insider from shipping a backdoor?
       | 
       | They are auto updating code with the highest privileges on
       | millions of machines. I'd expect their processes to be much much
       | more cautious.
        
       | m3kw9 wrote:
       | Still have kernel access
        
       | anonu wrote:
       | In my experience with outages, usually the problem lies in some
       | human error not following the process: Someone didn't do
       | something, checks weren't performed, code reviews were skipped,
       | someone got lazy.
       | 
       | In this post mortem there are a lot of words but not one of them
       | actually explains what the problem was. which is: what was the
       | process in place and why did it fail?
       | 
       | They also say a "bug in the content validation". Like what kind
       | of bug? Could it have been prevented with proper testing or code
       | review?
        
       | 1970-01-01 wrote:
       | >When received by the sensor and loaded into the Content
       | Interpreter, problematic content in Channel File 291 resulted in
       | an out-of-bounds memory read triggering an exception.
       | 
       | Wasn't 'Channel File 291' a garbage file filled with null
       | pointers? Meaning it's problematic content in the same way as
       | filling your parachute bag with ice cream and screws is
       | problematic.
        
         | hyperpape wrote:
         | They specifically denied that null bytes were the issue in an
         | earlier update. https://www.crowdstrike.com/blog/falcon-update-
         | for-windows-h...
        
           | 1970-01-01 wrote:
           | Null pointers, not a null array
        
       ___________________________________________________________________
       (page generated 2024-07-24 23:06 UTC)