[HN Gopher] Preliminary Post Incident Review
___________________________________________________________________
Preliminary Post Incident Review
Author : cavilatrest
Score : 104 points
Date : 2024-07-24 04:35 UTC (18 hours ago)
(HTM) web link (www.crowdstrike.com)
(TXT) w3m dump (www.crowdstrike.com)
| Scaevolus wrote:
| "problematic content"? It was a file of all zero bytes. How
| exactly was that produced?
| Zironic wrote:
| If I had to guess blindly based on their writeup, it would seem
| that if their Content Configuration System is given invalid
| data, instead of aborting the template, it generates a null
| template.
|
| To a degree it makes sense because it's not unusual for a
| template generator to provide a null response if given invalid
| inputs however the Content Validator then took that null and
| published it instead of handling the null case as it should
| have.
| jiggawatts wrote:
| Returning null instead of throwing an exception when an error
| occurs is the quality of programming I see from junior
| outsourced developers.
|
| "if (corrupt digital signature) return null;"
|
| is the type of code I see buried in authentication systems,
| gleefully converting what should be a sudden stop into a
| shambling zombie of invalid state and null reference
| exceptions fifty pages of code later in some controller
| that's already written to the database on behalf of an
| attacker.
|
| If I peer into my crystal ball I see a vision of CrowdStrike
| error handling code quality that looks suspiciously the same.
|
| (If I sound salty, it's because I've been cleaning up their
| mess since last week.)
| chrisjj wrote:
| The've said the crash was not related to those zero bytes.
| https://www.crowdstrike.com/blog/falcon-update-for-windows-h...
| romwell wrote:
| This reads like a bunch of baloney to obscure the real problem.
|
| The only relevant part you need to see:
|
| _> Due to a bug in the Content Validator, one of the two
| Template Instances passed validation despite containing
| problematic content data_.
|
| _Problematic content_? Yeah, this is telling exactly nothing.
|
| Their mitigation is "ummm we'll test more and maybe not roll the
| updates to everyone at once", without any direct explanation on
| how that would prevent this from happening again.
|
| Conspicuously absent:
|
| -- fixing whatever produced "problematic content"
|
| -- fixing whatever made it possible for "problematic content" to
| cause "ungraceful" crashes
|
| -- rewriting code so that the Validator and Interpreter would use
| the _same_ code path to catch such issues in test
|
| -- allowing the sysadmins to roll back updates before the OS
| boots
|
| -- diversifying the test environment to include _actual_ client
| machine configurations running _actual_ releases _as they would
| be received by clients_
|
| This is a nothing sandwich, not an incident review.
| Zironic wrote:
| >Add additional validation checks to the Content Validator for
| Rapid Response Content. A new check is in process to guard
| against this type of problematic content from being deployed in
| the future.
|
| >Enhance existing error handling in the Content Interpreter.
|
| They did write that they intended to fix the bugs in both the
| validator and the interpreter. Though it's a big mystery to me
| and most of the comments on the topic how an interpreter that
| crashes on a null template would ever get into production.
| romwell wrote:
| _> They did write that they intended to fix the bugs_
|
| I strongly disagree.
|
| _Add additional validation_ and _enhance error handling_ say
| as much as "add band-aids and improve health" in response to
| a broken arm.
|
| Which is not something you'd want to hear from a kindergarten
| that sends your kid back to you with shattered bones.
|
| Note that the things I said were missing _are_ indeed missing
| in the "mitigation".
|
| In particular, additional checks and "enhanced" error
| handling don't address:
|
| -- the fact that it's possible for content to be
| "problematic" for interpreter, but not the validator;
|
| -- the possibility for "problematic" content to crash the
| entire system still remaining;
|
| -- nothing being said about _what_ made the content
| "problematic" (spoiler: a bunch of zeros, but they didn't say
| it), _how_ that content was produced in the first place, and
| the possibility of it happening in the future still
| remaining;
|
| -- the fact that their clients _aren 't in control of their
| own systems_, have no way to roll back a bad update, and can
| have their entire fleet disabled or compromised by
| CrowdStrike in an instant;
|
| -- the business practices and incentives that didn't result
| in all their "mitigation" steps ( _as well as_ steps
| addressing the above) being _already_ implemented still
| driving CrowdStrike 's relationship with its employees and
| clients.
|
| The latter is particularly important. This is less a software
| issue, and more an _organizational_ failure.
|
| Elsewhere on HN and reddit, people were writing that
| ridiculous SLA's, such as "4 hour response to a
| vulnerability", make it practically impossible to release
| well-tested code, and that reliance on a rootkit for security
| is little more than CYA -- which means that the writing was
| on the wall, and _this will happen again_.
|
| You can't fix bad business practices with bug fixes and
| improved testing. And you can't fix what you don't look into.
|
| Hence my qualification of this "review" as a red herring.
| chrisjj wrote:
| > people were writing that ridiculous SLA's, such as "4
| hour response to a vulnerability
|
| I didn't see people explaining why this was ridiculous.
|
| > make it practically impossible to release well-tested
| code
|
| That falsely presumes the release must be code.
|
| CrowdStrike say of the update that caused the crash: "This
| Rapid Response Content is stored in a proprietary binary
| file that contains configuration data. It is not code or a
| kernel driver."
| romwell wrote:
| _> I didn't see people explaining why this was
| ridiculous._
|
| Because of how it affects priorities and incentives.
|
| E.g.: as of 2024, CrowdStrike didn't implement staggered
| rollout of Rapid Response content. If you spend a second
| thinking why that's the case, you'll realize that _rapid_
| and _staggered_ are literally antithetical.
|
| _> CrowdStrike say of the update that caused the crash:
| "This Rapid Response Content is stored in a proprietary
| binary file that contains configuration data. It is not
| code or a kernel driver."_
|
| Well, they are lying.
|
| The data that you feed into an _interpreter_ is code, no
| matter what they want to call it.
| GoblinSlayer wrote:
| It's not your kid, so "improve health" is the industry
| standard response here.
| romwell wrote:
| True, but the question is why they can keep getting away
| with that.
| TheFragenTaken wrote:
| What validates the Content Validator? A Content Validator
| Validator?
| citrin_ru wrote:
| > fixing whatever made it possible for "problematic content" to
| cause "ungraceful" crashes
|
| Better not only fix this specific bug but continuously use
| fuzzing to find more places where external data (including
| updates) can trigger a crash (or worse RCE)
| romwell wrote:
| That is indeed necessary.
|
| But it seems to me that putting the interpreter in a place in
| the OS where it _can_ cause a system crash with the be the
| behavior that it 's allowed to do is a fundamental design
| choice that is not at all addressed by fuzzing.
| cratermoon wrote:
| An interpreter that handles data downloaded from the
| internet even. That's an exploit waiting to happen.
| acdha wrote:
| Also "using memory safe languages for critical components" and
| "detecting failures to load and automatically using the last-
| known-good configuration"
| romwell wrote:
| Copying my content from the duplicate thread[1] here:
|
| This reads like a bunch of baloney to obscure the real problem.
| The only relevant part you need to see:
|
| _> Due to a bug in the Content Validator, one of the two
| Template Instances passed validation despite containing
| problematic content data_.
|
| _Problematic content_? Yeah, this is telling exactly nothing.
|
| Their mitigation is "ummm we'll test more and maybe not roll the
| updates to everyone at once", without any direct explanation on
| how that would prevent this from happening again.
|
| Conspicuously absent:
|
| -- fixing whatever produced "problematic content"
|
| -- fixing whatever made it possible for "problematic content" to
| cause "ungraceful" crashes
|
| -- rewriting code so that the Validator and Interpreter would use
| the _same_ code path to catch such issues in test
|
| -- allowing the sysadmins to roll back updates before the OS
| boots
|
| -- diversifying the test environment to include _actual_ client
| machine configurations running _actual_ releases _as they would
| be received by clients_
|
| This is a nothing sandwich, not an incident review.
|
| [1] https://news.ycombinator.com/item?id=41053703
| dang wrote:
| > Copying my content from the duplicate thread[1] here
|
| Please don't do this! It makes merging threads a pain because
| then we have to find the duplicate subthreads (i.e. your two
| comments) and merge the replies as well.
|
| Instead, if you or anyone will let us know at
| hn@ycombinator.com which threads need merging, we can do that.
| The solution is deduplication, not further duplication!
| rurban wrote:
| They bypassed the tests and staged deployment, because their
| previous update looked good. Ha.
|
| What if they implemented a release process, and follow it? Like
| everyone else does. Hackers at the workplace, sigh.
| CommanderData wrote:
| They know better obviously, transcending process and
| bureaucracy.
| fulafel wrote:
| Also it must have been a manual testing effort, otherwise there
| would be no motive to skip it. IOW, missing test automation.
| throwaway7ahgb wrote:
| Where do you see that, it looks like there was a bug in the
| template tester? Or you mean the manual tests?
| kasabali wrote:
| > Based on the testing performed before the initial
| deployment of the Template Type (on March 05, 2024), trust in
| the checks performed in the Content Validator, and previous
| successful IPC Template Instance deployments, these instances
| were deployed into production.
| lopkeny12ko wrote:
| > When received by the sensor and loaded into the Content
| Interpreter, problematic content in Channel File 291 resulted in
| an out-of-bounds memory read triggering an exception. This
| unexpected exception could not be gracefully handled, resulting
| in a Windows operating system crash (BSOD).
|
| There was a popular X thread [1] that a lot of people took issue
| with over the past week, but it hit the nail on the head for the
| root cause. I suspect a lot of HNers wbo criticized him now owe
| him an apology.
|
| [1] https://x.com/Perpetualmaniac/status/1814376668095754753
| nemetroid wrote:
| No, the Twitter poster is still wrong.
| bdjsiqoocwk wrote:
| It's called Twitter.
| joenot443 wrote:
| No, the name's been changed.
| bdjsiqoocwk wrote:
| No it hasn't.
| justusthane wrote:
| I'm not a fan of Musk or of the-platform-formerly-known-
| as-Twitter, but I'm not sure how you can insist that the
| name hasn't been changed.
| mc32 wrote:
| How can these companies be certified and compliant, etc., and
| then in practice have horrible SDLC?
|
| What was the impact of diverse teams (offshoring)? Often
| companies don't have necessary checks to ensure disparateness
| of teams does not impact quality. Maybe it was zero or maybe it
| was more.
| hulitu wrote:
| > How can these companies be certified and compliant, etc.,
| and then in practice have horrible SDLC?
|
| Checklists ?
| hello_moto wrote:
| You're saying there exist a complex software system without a
| bug despite following best practices to the dot and certified
| + compliant?
| mc32 wrote:
| No, but their release process should catch major bugs such
| as this. After internal QA, you release to small internal
| dev team, then to select members of other depts willing to
| dog-food it, then limited external partners then GA? Or
| something like that so that you have multiple opportunities
| to catch weird software/hardware interactions before
| bringing down business critical systems for major and small
| companies around the planet?
| hello_moto wrote:
| > After internal QA, you release to small internal dev
| team, then to select members of other depts willing to
| dog-food it, then limited external partners then GA
|
| What about AV definition update for 0day swimming in the
| tubes right now?
| mc32 wrote:
| Sure, those have happened before, but nothing with an
| impact like last weekend. That's inexcusable. At least
| definitions can update themselves out of trouble.
| hello_moto wrote:
| What do you refer to "those have happened before"?
|
| Isn't that what happened? Not a software update, not an
| AV-definition update but more so an AV-definition "data"
| update. At least that's how I interpret "Rapid Response
| Content"
| YZF wrote:
| Standards generally don't mandate specifics and almost
| certainly nothing specific to SDLC. At least none I've heard
| of. Things like FIPS and ISO and SOC2 generally prescribe
| having a certain process, sometimes they can mandate some
| specifics (e.g. what ciphers for FIPS). Maybe there should be
| some release process standards that prescribe how this is
| done but I'm not aware of any. I think part of the problem is
| the standard bodies don't really know what to prescribe, this
| sort of has to come from the community. Maybe not unlike the
| historical development of other engineering professions.
| Today being compliant with FIPS doesn't really mean you're
| secure and being SOC2 compliant doesn't really mean customer
| data is safe etc. It's more some sort of minimal bar in
| certain areas of practice and process.
| mc32 wrote:
| Sadly, I agree with your take. All it is is a minimum bar.
| Many who don't have the above are even worse --tho not
| necessarily, but as a rule probably yes.
| cangencer wrote:
| The thread is still wrong, since it was a OOB memory read, not
| a missing null pointer check as claimed. 0x9c is likely the
| value that just happened to be in the OOB read.
| fulafel wrote:
| "Incoming data triggered a out-of-bound memory access bug" is
| hardly a useful conclusion for a root cause investigation (even
| if you are of the faith of the single root cause).
| maples37 wrote:
| https://threadreaderapp.com/thread/1814376668095754753.html
| cataflam wrote:
| Not really, that thread showed only superficial knowledge and
| analysis, far from hitting the nail on the head, for anyone
| used to assembly/reverse engineering. Then goes on to make
| provably wrong assumptions and comments. There is actually a
| null check (2 even!) just before trying the memory access. The
| root cause is likely trying to access an address that's coming
| from some uninitialized or wrongly initialized or non-
| deterministically initialized array.
|
| What it did well was explaining the basics nicely for a wide
| audience who knows nothing about a crash dump or invalid memory
| access, which I guess made the post popular. Good enough for a
| general public explanation, but doesn't pass the bar for an
| actual technical one to any useful degree.
|
| I humbly concur with Tavis' take
|
| https://x.com/taviso/status/1814762302337654829
|
| Here are some others for more technically correct details: -
| https://x.com/patrickwardle/status/1814343502886477857 -
| https://x.com/tweetingjose/status/1814785062266937588
| nodesocket wrote:
| Why do they insist on using what sounds like military pseudo
| jargon throughout the document?
|
| ex. sensors? I mean how about hosts, machines, clients?
| com wrote:
| It's endemic in the tech security industry - they've been
| mentally colonised by ex-mil and ex-law enforcement (wannabe
| mil) folks for a long time.
|
| I try to use social work terms and principles in professional
| settings, which blows these people's minds.
|
| Advocacy, capacity evaluation, community engagement, cultural
| competencies, duty of care, ethics, evidence-based
| intervention, incentives, macro-, mezzo- and micro-practice,
| minimisation of harm, respect, self concept, self control etc
| etc
|
| It means that my teams aren't focussed on "nuking the bad guys
| from orbit" or whatever, but building defence in depth and
| indeed our own communities of practice (hah!), and using
| psychological and social lenses as well as tech and adversarial
| ones to predict, prevent and address disruptive and dangerous
| actors.
|
| YMMV though.
| phaedrus wrote:
| Even computer security itself is a metaphor (at least in its
| inception). I often wonder what if instead of using terms
| like access, key, illegal operation, firewall, etc. we'd
| instead chosen metaphors from a different domain, for example
| plumbing. I'm sure a plumbing metaphor could also be found
| for every computer security concern. Would be so quick to
| romanticize as well as militarize a field dealing with
| "leaks," "blockages," "illegal taps," and "water quality"?
| com wrote:
| "Fatbergs" expresses some things delivered by some teams
| very eloquently for me!
| notepad0x90 wrote:
| because those things are different? i didn't see a single
| "military" jargon. there is absolutely nothing unusual about
| their wording. It's like someone saying "why do these people
| use such nerdy words" regarding HN content.
| justusthane wrote:
| The sensor isn't a host, machine, or a client. It's the
| software component that detects threats. I guess maybe you
| could call it an agent instead, but I think sensor is pretty
| accepted terminology in the EDR space - it's not specific to
| Crowdstrike.
| coremoff wrote:
| Such a disingenuous review; waffle and distraction to hide the
| important bits (or rather bit: bug in content validator) behind a
| wall of text that few people are going to finish.
|
| If this is how they are going to publish what happened, I don't
| have any hope that they've actually learned anything from this
| event.
|
| > Throughout this PIR, we have used generalized terminology to
| describe the Falcon platform for improved readability
|
| Translation: we've filled this PIR with technobable so that when
| you don't understand it you won't ask questions for fear of
| appearing slow.
| notepad0x90 wrote:
| > "behind a wall of text that few people are going to finish."
|
| heh? it's not that long and very readable.
| coremoff wrote:
| I disagree; it's much longer than it needs to be, is filled
| with pseudo-technoese to hide that there's little of
| consequence in there, and the tiny bit of real information in
| there is couched with distractions and unnecessary detail.
|
| As I understand it, they're telling us that the outage was
| caused by an unspecified bug in the "Content Validator", and
| that the file that was shipped was done so without testing
| because it worked fine last time.
|
| I think they wrote what they did because they couldn't
| publish the above directly without being rightly excoriated
| for it, and at least this way a lot of the people reading it
| won't understand what they're saying but it sounds very
| technical.
| notepad0x90 wrote:
| no, it's one of most well written PIR's I've seen. It
| establishes terms and procedures after communicating that
| this isn't an RCA, then they detail the timeline of tests
| and deployments done and what went wrong. They were not
| excessively verbose or terse. This is the right way of
| communicating to the intended audience. It is both
| technical people, executives and law makers alike that will
| be reading this. They communicated their findings clearly
| without code, screenshots, excessive historical details and
| other distractions.
| hello_moto wrote:
| In the current situation, it's better to be complete no?
|
| This information is not just for _you_.
| CommanderData wrote:
| "We didn't properly test our update."
|
| Should be the tldr. On threads there's information about
| CrordStrike slashing QA team numbers, whether that was a factor
| should be looked at.
| hulitu wrote:
| They write perfect software. Why should they test it ? /s
| Ukv wrote:
| A summary, to my understanding:
|
| * Their software reads config files to determine which behavior
| to monitor/block
|
| * A "problematic" config file made it through automatic
| validation checks "due to a bug in the Content Validator"
|
| * Further testing of the file was skipped because of "trust in
| the checks performed in the Content Validator" and successful
| tests of previous versions
|
| * The config file causes their software to perform an out-of-
| bounds memory read, which it does not handle gracefully
| Narretz wrote:
| * Further testing of the file was skipped because of "trust in
| the checks performed in the Content Validator" and successful
| tests of previous versions
|
| that's crazy. How costly can it be to test the file fully in a
| CI job? I fail to see how this wasn't implemented already.
| modestygrime wrote:
| Just reeks of incompetence. Do they not have e2e smoketests
| of this stuff?
| denton-scratch wrote:
| > How costly can it be to test the file fully in a CI job?
|
| It didn't need a CI job. It just needed one person to
| actually boot and run a Windows instance with the Crowdstrike
| software installed: a smoke test.
|
| TFA is mostly an irrelevent discourse on the product
| architecture, stuffed with proprietary Crowdstrike jargon,
| with about a couple of paragraphs dedicated to the actual
| problem; and they don't mention the non-existence of a smoke
| test.
|
| To me, TFA is _not_ a signal that Crowdstrike has a plan to
| remediate the problem, yet.
| hrpnk wrote:
| They mentioned they do dogfooding. Wonder why it did not
| work for this update.
| xh-dude wrote:
| They discuss dogfooding "Sensor Content", which isn't
| "Rapid Response Content".
|
| Overall the way this is written up suggests some cultural
| problems.
| stefan_ wrote:
| You just got tricked by this dishonest article. The whole
| section that mentions dogfooding is only about actual
| updates to the kernel driver. This was not a kernel
| driver update, the entire section is irrelevant.
| red2awn wrote:
| > How Do We Prevent This From Happening Again?
|
| > Software Resiliency and Testing
|
| > * Improve Rapid Response Content testing by using testing types
| such as:
|
| > * Local developer testing
|
| So no one actually tested the changes before deploying?!
| Narretz wrote:
| And why is it "local developer testing" and not CI/CD. This
| makes them look like absolute amateurs.
| belter wrote:
| > This makes them look like absolute amateurs.
|
| This applies also to all Architects and CTO's at all these
| Fortune 500 companies, who allowed these self updating
| systems into their critical systems.
|
| I would offer a copy of Antifragile to each of these teams:
| https://en.wikipedia.org/wiki/Antifragile_(book)
|
| "Every captain goes down with every ship"
| acdha wrote:
| Architects likely do not have a choice. These things are
| driven by auditors and requirements for things like
| insurance or PCI and it's expensive to protest those. I
| know people who've gone full serverless just to lop off the
| branches of the audit tree about general purpose server
| operating systems, and now I'm wondering whether anyone is
| thinking about iOS/ChromeOS for the same reason.
|
| The more successful path here is probably demanding proof
| of a decent SDLC, use of memory-safe languages, etc. in
| contract language.
| belter wrote:
| > Architects likely do not have a choice.
|
| Architects don't have a choice, CTO are well paid to golf
| with the CEO and _delegate_ to their teams, Auditors just
| audit but are not involved with the technical
| implementations, Developers just develop according to the
| Spec, and Security team just are a pain in the ass.
| Nobody owns it...
|
| Everybody get's well paid, and at the end we have to get
| lessons learned...It's a s*&^&t show...
| mardifoufs wrote:
| Some industries are forced by regulation or liability to
| have something like crowdstrike deployed on their
| systems. And crowdstrike doesn't have a lot of
| alternatives that tick as many checkboxes and are as
| widely recognized.
| belter wrote:
| Please give me an example of that _specific_ regulation.
| hello_moto wrote:
| Seems like everyone thinks that Execs play golf with
| another Execs to seal the deal regardless how b0rken the
| system is.
|
| That CTO's job is on the line if the system can't meet
| the requirement, more so if the system is fucked.
|
| To think that every CTO is dumbass is like saying
| "everyone is stupid, except me, of course"
| belter wrote:
| Not all CTO...but you just saw hundreds of companies, who
| could do better....
| hello_moto wrote:
| That is true, hundred companies have no backup process in
| place :D
| RaftPeople wrote:
| The fact that they even listed "local developer testing" is
| pretty weird.
|
| That is just part of the basic process and is hardly the
| thing that ensures a problem like this doesn't happen.
| radicaldreamer wrote:
| They don't care, CI/CD, like QA, is considered a cost center
| for some of these companies. The cheapest thing for them is
| to offload the burden of testing every configuration onto the
| developer, who is also going to be tasked with shipping as
| quickly as possible or getting canned.
|
| Claw back executive pay, stock, and bonuses imo and you'll
| see funded QA and CI teams.
| hyperpape wrote:
| It sure sounds like the "Content Validator" they mention is a
| form of CI/CD. The problem is that it passed that validation,
| but was capable of failing in reality.
| spacebanana7 wrote:
| This also becomes a security issue at some point. If these
| updates can go in untested, what's to stop a rogue employee
| from deliberately pushing a malicious update?
|
| I know insider threats are very hard to protect against in
| general but these companies must be the most juicy target for
| state actors. Imagine what you could do with kernel space code
| in emergency services, transport infrastructure and banks.
| nine_zeros wrote:
| Will managers continue to push engineers even when engineers
| advise to go slower or no?
| bobwaycott wrote:
| Always.
| Cyphase wrote:
| Lots of words about improving testing of the Rapid Response
| Content, very little about "the sensor client should not ever
| count on the Rapid Response Content being well-formed to avoid
| crashes".
|
| > Enhance existing error handling in the Content Interpreter.
|
| That's it.
|
| Also, it sounds like they might have separate "validation" code,
| based on this; why is "deploy it in a realistic test fleet" not
| part of validation? I notice they haven't yet explained anything
| about what the Content Validator does to validate the content.
|
| > Add additional validation checks to the Content Validator for
| Rapid Response Content. A new check is in process to guard
| against this type of problematic content from being deployed in
| the future.
|
| Could it say any less? I hope the new check is a test fleet.
|
| But let's go back to, "the sensor client should not ever count on
| the Rapid Response Content being well-formed to avoid crashes".
| hun3 wrote:
| Is error handling enough? A perfectly valid rule file could
| hang (but not outright crash) the system, for example.
| throwanem wrote:
| If the rules are Turing-complete, then sure. I don't see
| enough in the report to tell one way or another; the way
| rules are made to sound as if filling templates about equally
| suggests either (if templates may reference other templates)
| and there is not a lot more detail. Halting seems relatively
| easy to manage with something like a watchdog timer, though,
| compared to a sound, crash- and memory-safe* parser for a
| whole programming language, especially if that language
| exists more or less by accident. (Again, no claim; there's
| not enough available detail.)
|
| I would not want to do any of this directly on metal, where
| the only safety is what you make for yourself. But that's the
| line Crowdstrike are in.
|
| * By EDR standards, at least, where "only" one reboot a week
| forced entirely by memory lost to an unkillable process
| counts as exceptionally _good._
| ReaLNero wrote:
| Perhaps set a timeout on the operation then? Given this is
| kernel it's not as easy as userspace, but I'm sure you could
| request to set a interrupt on a timer.
| SoftTalker wrote:
| > it sounds like they might have separate "validation" code
|
| That's what stood out to me. From the CS post: "Template
| Instances are created and configured through the use of the
| Content Configuration System, which includes the Content
| Validator that performs validation checks on the content before
| it is published."
|
| Lesson learned, a "Validator" that is not actually the _same
| program_ that will be parsing /reading the file in production,
| is not a complete test. It's not entirely useless, but it
| doesn't guarantee anything. The production program could have a
| latent bug that a completely "valid" (by specification) file
| might trigger.
| modestygrime wrote:
| I'd argue that it is completely useless. They have the actual
| parser that runs in production and then a separate "test
| parser" that doesn't actually reflect reality? Why?
| Cyphase wrote:
| Direct link to the PIR, instead of the list of posts:
| https://www.crowdstrike.com/blog/falcon-content-update-preli...
| Cyphase wrote:
| The article link has been updated to that; it used to be the
| "hub" page at https://www.crowdstrike.com/falcon-content-
| update-remediatio...
|
| Some updates from the hub page:
|
| They published an "executive summary" in PDF format:
| https://www.crowdstrike.com/wp-content/uploads/2024/07/Crowd...
|
| That includes a couple of bullet points under "Third Party
| Validation" (independent code/process reviews), which they
| added to the PIR on the hub page, but not on the dedicated PIR
| page.
|
| > Updated 2024-07-24 2217 UTC
|
| > ### Third Party Validation
|
| > - Conduct multiple independent third-party security code
| reviews.
|
| > - Conduct independent reviews of end-to-end quality processes
| from development through deployment.
| squirrel wrote:
| There's only one sentence that matters:
|
| "Provide customers with greater control over the delivery of
| Rapid Response Content updates by allowing granular selection of
| when and where these updates are deployed."
|
| This is where they admit that:
|
| 1. They deployed changes to their software directly to customer
| production machines; 2. They didn't allow their clients any
| opportunity to test those changes before they took effect; and 3.
| This was cosmically stupid and they're going to stop doing that.
|
| Software that does 1. and 2. has absolutely no place in critical
| infrastructure like hospitals and emergency services. I predict
| we'll see other vendors removing similar bonehead "features" very
| very quietly over the next few months.
| hello_moto wrote:
| > I predict we'll see other vendors removing similar bonehead
| "features" very very quietly over the next few months.
|
| Absolutely this is what will happen.
|
| I don't know much about the practice of AV definition-like
| feature across Cybersecurity but I would imagine there might be
| a possibility that no vendors do rolling update today because
| it involves Opt-in/Opt-out which might influence the vendor's
| speed to identify attack which in turns affect their
| "Reputation" as well.
|
| "I bought Vendor-A solution but I got hacked and have to pay
| Ransomware" (with a side note: because I did not consume the
| latest critical update of AV definition) is what Vendors
| worried.
|
| Now that this Global Outage happened, it will change the
| landscape a bit.
| XlA5vEKsMISoIln wrote:
| >Now that this Global Outage happened, it will change the
| landscape a bit.
|
| I seriously doubt that. Questions like "why should we use
| CrowdStrike" will be met with "suppose they've learned their
| lesson".
| mr_mitm wrote:
| Does anyone test their antivirus updates individually as a
| customer? I thought they happen multiple times a day, who has
| time for that?
| packetlost wrote:
| Yes? Not consumers typically, but many IT departments with
| certain risk profiles absolutely do.
| packetlost wrote:
| I really wish we would get some regulation as a result of this.
| I know people that almost died due to hospitals being down. It
| should be absolutely mandatory for users, IT departments, etc.
| to be able to control when and where updates happen on their
| infrastructure but *especially* so for critical infrastructure.
| SketchySeaBeast wrote:
| Unfortunately, putting the onus on risk adverse organizations
| like hospitals and governments to validate the AV changes means
| they just won't get pushed and will be chronically exposed.
|
| That said, maybe Crowdstrike should considering validating
| every step of the delivery pipeline before pushing to
| customers.
| throw0101d wrote:
| > _Unfortunately, putting the onus on risk adverse
| organizations like hospitals and governments to validate the
| AV changes means they just won 't get pushed and will be
| chronically exposed._
|
| I have a similar feeling.
|
| At the very least perhaps have an "A" and a "B" update
| channel, where "B" is _x_ hours behind A. This way if, in an
| HA configuration, one side goes down there 's time to deal
| with it while your B-side is still up.
| dmazzoni wrote:
| Why can't they just do it more like Microsoft security
| patches, making them mandatory but giving admins control over
| when they're deployed?
| XlA5vEKsMISoIln wrote:
| That would be equivalent to asking "would you prefer your
| fleet to bluescreen now, or later" in this case.
| jaggederest wrote:
| Presumably you could roll out to 1% and report issues
| back to the vendor before the update was applied to the
| last 99%. So a headache but not "stop the world and
| reboot" levels of hassle.
| bawolff wrote:
| > They deployed changes to their software directly to customer
| production machines; 2. They didn't allow their clients any
| opportunity to test those changes before they took effect; and
| 3. This was cosmically stupid and they're going to stop doing
| that.
|
| Is it really all that surprising? This is basically their
| business model - its a fancy virus scanner that is supposed to
| instantly respond to threats.
| 98codes wrote:
| Combined with this, presented as a change they could
| _potentially_ make, it 's a killer:
|
| > Implement a staggered deployment strategy for Rapid Response
| Content in which updates are gradually deployed to larger
| portions of the sensor base, starting with a canary deployment.
|
| They weren't doing any test deployments at all before blasting
| the world with an update? Reckless.
| nathanlied wrote:
| >I predict we'll see other vendors removing similar bonehead
| "features" very very quietly over the next few months.
|
| If indeed this happens, I'd hail this event as a victory
| overall; but industry experience tells me that most of those
| companies will say "it'd never happen with us, we're a lot more
| careful", and keep doing what they're doing.
| brianmback wrote:
| The only thing worse than the Crowdstrike incident is the UI they
| used to publish this report
| duped wrote:
| Here is my summary with the marketing bullshit ripped out.
|
| Falcon configuration is shipped with both direct driver updates
| ("sensor content"), and out of band ("rapid response content").
| "Sensor Content" are scripts (*) that ship with the driver.
| "Rapid response content" are data that can be delivered
| dynamically.
|
| One way that "Rapid Response Content" is implemented is with
| templated "Sensor Content" scripts. CrowdStrike can keep the
| behavior the same but adjust the parameters by shipping "channel"
| files that fill in the templates.
|
| "Sensor content", including the templates, are a part of the
| normal test and release process and goes through
| testing/verification before being signed/shipped. Customers have
| control over rollouts and testing.
|
| "Rapid Response Content" is deployed through a different channel
| that customers do not have control over. Crowdstrike shipped a
| broken channel file that passed validation but was not tested.
|
| They are going to fix this by adding testing of "rapid response"
| content updates and support the same rollout logic they do for
| the driver itself.
|
| (*) I'm using the word "script" here loosely. I don't know what
| these things are, but they sound like scripts.
|
| ---
|
| In other words, they have scripts that would crash given garbage
| arguments. The validator is supposed to check this before they
| ship, but the validator screwed it up (why is this a part of
| release and not done at runtime? (!)). It appears they did not
| test it, they do not do canary deployments or support rollout of
| these changes, and everything broke.
|
| Corrupting these channel files sounds like a promising way to
| attack CS, I wonder if anyone is going down that road.
| hello_moto wrote:
| > Corrupting these channel files sounds like a promising way to
| attack CS, I wonder if anyone is going down that road.
|
| Would have happened long time ago if it was that easy no?
| duped wrote:
| How do we know it hasn't?
| hello_moto wrote:
| If it happened, the industry would have known by now.
|
| The group behind it will come out to the public.
| sudosysgen wrote:
| This would be the kind of vulnerability that would be
| worth millions of dollars and used for targeted attacks
| and/or by state actors. It could take years to uncover
| (like Pegasus, which took 5 years to be discovered) or
| never be uncovered at all.
| EricE wrote:
| A file full of zeros is an "undetected error"? Good grief.
| dmazzoni wrote:
| It wasn't a file full of zeros that caused the problem.
|
| While some affected users did have a file full of zeros, that
| was actually a result of the system in the process of trying to
| download an update, and not the version of the file that caused
| the crash.
| jvreeland wrote:
| I really dislike reading website that take over half the screen
| and make me read off to the side like this. I can fix it by
| zooming in but I don't understand why they thought making the
| navigation take up that much of the screen or not be collapsable
| was a good move.
| sgammon wrote:
| 1) Everything went mostly well
|
| 2) The things that did not fail went so great
|
| 3) Many many machines did not fail
|
| 4) macOS and Linux unaffected
|
| 5) Small lil bug in the content verifier
|
| 6) Please enjoy this $10 gift card
|
| 7) Every windows machine on earth bsod'd but many things worked
| mikequinlan wrote:
| Regarding the gift card, TechCrunch says
|
| "On Wednesday, some of the people who posted about the gift
| card said that when they went to redeem the offer, they got an
| error message saying the voucher had been canceled. When
| TechCrunch checked the voucher, the Uber Eats page provided an
| error message that said the gift card "has been canceled by the
| issuing party and is no longer valid.""
|
| https://techcrunch.com/2024/07/24/crowdstrike-offers-a-10-ap...
| sgammon wrote:
| There's a KB up about this now. To use your voucher, reboot
| into safe mode and...
| mikequinlan wrote:
| On another forum a person replied...
|
| >The system to redeem the card is probably stuck in a boot
| loop
| rm445 wrote:
| Fun post, but I'll state the obvious because I think many
| people do believe that every Windows machine BSOD'd. It was
| only ones with Crowdstrike software. Which is apparently very
| common but isn't actually pre-installed by Microsoft in
| Windows, or anything like that.
|
| Source: work in a Windows shop and had a normal day.
| sgammon wrote:
| True, and definitely worth a mention. This is only
| Microsoft's fault insofar as it was possible at all to crash
| this way, this broadly, with so little recourse via remote
| tooling.
| cataflam wrote:
| Besides missing the actual testing (!), the staged rollout (!),
| looks like they also weren't fuzzing this kernel driver that
| routinely takes instant worldwide updates. Oops.
| l00tr wrote:
| check their developer github, "i write kernel-safe bytecode
| interpreters" :D, https://github.com/bushidocodes/bushidocodes
| brcmthrowaway wrote:
| He Codes With Honor(tm)
| gostsamo wrote:
| Cowards. Why don't you just stand up and admit that you didn't
| bother testing everything you send to production?
|
| Everything else is smoke and the smell of sulfur.
| aeyes wrote:
| Do you see how they only talk about technical changes to prevent
| this from happening again?
|
| To me this was a complete failure on the process and review side.
| If something so blatantly obvious can slip through, how could
| ever I trust them to prevent an insider from shipping a backdoor?
|
| They are auto updating code with the highest privileges on
| millions of machines. I'd expect their processes to be much much
| more cautious.
| m3kw9 wrote:
| Still have kernel access
| anonu wrote:
| In my experience with outages, usually the problem lies in some
| human error not following the process: Someone didn't do
| something, checks weren't performed, code reviews were skipped,
| someone got lazy.
|
| In this post mortem there are a lot of words but not one of them
| actually explains what the problem was. which is: what was the
| process in place and why did it fail?
|
| They also say a "bug in the content validation". Like what kind
| of bug? Could it have been prevented with proper testing or code
| review?
| 1970-01-01 wrote:
| >When received by the sensor and loaded into the Content
| Interpreter, problematic content in Channel File 291 resulted in
| an out-of-bounds memory read triggering an exception.
|
| Wasn't 'Channel File 291' a garbage file filled with null
| pointers? Meaning it's problematic content in the same way as
| filling your parachute bag with ice cream and screws is
| problematic.
| hyperpape wrote:
| They specifically denied that null bytes were the issue in an
| earlier update. https://www.crowdstrike.com/blog/falcon-update-
| for-windows-h...
| 1970-01-01 wrote:
| Null pointers, not a null array
___________________________________________________________________
(page generated 2024-07-24 23:06 UTC)