[HN Gopher] The CrowdStrike file that broke everything was full ...
___________________________________________________________________
The CrowdStrike file that broke everything was full of null
characters
Author : behnamoh
Score : 240 points
Date : 2024-07-19 18:47 UTC (4 hours ago)
(HTM) web link (twitter.com)
(TXT) w3m dump (twitter.com)
| TillE wrote:
| That probably explains how it got past internal testing.
| Something went wrong after that, during deployment.
| behnamoh wrote:
| It could be as simple as cosmic radiation that flipped a bit
| (it has happened before:
| https://www.independent.co.uk/news/science/subatomic-
| particl...), or as sophisticated as an adversarial hacking.
| rvnx wrote:
| The same cosmic radiation that flips the bits to make some
| specific political party win.
| zimpenfish wrote:
| Someone on the Fediverse conjectured that it might have been
| down to the Azure glitch earlier in the day. An empty file
| would fit that if they weren't doing proper error checking on
| their downloads, etc.
| homero wrote:
| It's crazy if they weren't signing and verifying downloads
| janice1999 wrote:
| It's still crazy that a security tool does not validate content
| files it loads from disk that get regularly updated. Clearly
| fuzzing was not a priority either.
| Zigurd wrote:
| How many years has this Crowdstrike code been running without
| issues? You have put your finger on it: Fuzzing should have
| been part of a test plan. Even TDD isn't a bastard test
| engineer writing tests that probe edge cases. Even observing
| that your unit tests have good code coverage isn't a
| substitute for fuzzing. There is even a counter-argument that
| something that been reliable in the field should not be fixed
| for reasons like failing a test case never seen in real
| deployments, so why go making trouble.
| password4321 wrote:
| https://news.ycombinator.com/item?id=41006104#41006555
|
| _the flawed data was added in a post-processing step of the
| configuration update, which is after it 's been tested
| internally but before it's copied to their update servers_
| gbin wrote:
| I don't understand, how the signature even worked? Please
| please tell me those drivers are signed... Right? ...
| Retr0id wrote:
| While I won't discount it entirely, I think the people acting
| like this (alone) implies malice are being very silly.
| JumpCrisscross wrote:
| It demonstrates terrible QC.
| loloquwowndueo wrote:
| Don't mess with Quebec :D
| Retr0id wrote:
| Clearly _something_ failed catastrophically, but it could
| well be post-QC
| cjbprime wrote:
| There should be no "post-QC". You do gradual rollout across
| the fleet, while checking your monitoring to ensure the
| fleet hasn't gone down.
| Retr0id wrote:
| Non-gradual-rollout updates are an exacerbating factor,
| but it isn't a root cause.
| midtake wrote:
| I disagree. In the current day the stakes are too highly to
| naively attribute software flaws to incompetence. We should
| assume malice until it is ruled out, otherwise it will become a
| vector for software implants. These are matters of national
| security at this point.
| Retr0id wrote:
| You can (and should) want to identify the root cause, without
| assuming malice.
| gerdesj wrote:
| "These are matters of national security at this point."
|
| Which nation exactly? Who on earth "wins" by crashing vast
| numbers of PCs worldwide?
|
| Many of the potential foes you might be thinking of are
| unlikely to actually run CS locally but its bad for business
| if your prey can't even boot their PCs and infra so you can
| scam them.
|
| I might allow for a bunch of old school nihilists getting off
| on this sort of rubbish but it won't last and now an entire
| class of security software, standards and procedures will be
| fixed up. This is no deliberate "killer blow".
|
| Who knew that well meaning security software running in Ring
| 0 could fuck up big style if QA takes a long walk off a short
| plank? Oh, anyone who worked in IT during the '90s and '00s!
| I remember Sophos and McAfee (now Trellix) and probably
| others managing to do something similar, back in the day.
|
| Mono-cultures are subject to pretty catastrophic failures, by
| definition. If you go all in with the same thing as everyone
| else then if they sneeze, you will catch the 'flu too.
| fourteenfour wrote:
| At least it compressed well, which must have saved network
| resources during the update. :)
| sitkack wrote:
| Resources like wall clock time.
| bloopernova wrote:
| Note to self: on Monday, add a null character check to pre-commit
| hooks, and add the same check to pipelines.
| Retr0id wrote:
| It's perfectly normal for binary artifacts to contain null
| bytes, even long runs of them.
| bloopernova wrote:
| Yeah, I'd need to figure it out properly, but for unicode
| text files it should be OK. Good point about the binaries
| though, thank you!
| hawski wrote:
| You say Unicode, but you mean UTF-8. Now for 16 bit Unicode
| the story is different :)
| MilStdJunkie wrote:
| I mentioned it in a separate parent, but null purge is - for
| the stuff I work with - completely non-negotiable. Nulls seem
| to break virtually everything, just by existing. Furthermore,
| old-timey PDFs are chock full of the things, for God knows what
| reason, and a huge amount of data I work with are old-timey
| PDF.
| toast0 wrote:
| > Furthermore, old-timey PDFs are chock full of the things,
| for God knows what reason, and a huge amount of data I work
| with are old-timey PDF.
|
| Probably UCS-2/UTF-16 encoding with ascii data.
| mrguyorama wrote:
| The problem ISN'T the null character though. The problem is
| that they tested the system, THEN changed stuff, then uploaded
| the changed stuff.
|
| Their standard methodology was to deploy untested stuff.
| thrill wrote:
| So ... the checksum of all the Seinfeld episodes?
| geor9e wrote:
| Explain joke please
| themagicteeth wrote:
| Seinfeld is a show about nothing
|
| http://seinfeldscripts.com/ThePitch.htm
| WhyCause wrote:
| Seinfeld was "a show about nothing."
| kristjansson wrote:
| Something like zero_output_file(fh, len(file))
| flush() fill_output_file(fh, data)
|
| with an oops in line 3?
| AndrewKemendo wrote:
| This should not have passed a competent C/I pipeline for a system
| in the critical path.
|
| I'm not even particularly stringent when it comes to automated
| test across-the-board but for this level of criticality of
| system, you need exceptionally good state management
|
| To the point where you should not roll to production without an
| integration test on every environment that you claim to support
|
| Like it's insane to me that this size and criticality of a
| company doesn't have a staging or even a development test server
| that tests all of the possible target images that they claim to
| support.
|
| Who is running stuff over there - total incompetence
| martinky24 wrote:
| A lot of assumptions here that probably aren't worth making
| without more info -- For example it could certainly be the case
| that there was a "real" file that worked and the bug was in the
| "upload verified artifact to CDN code" or something, at which
| point it passes a lot of things before the failure.
|
| We don't have the answers, but I'm not in a rush to assume that
| they don't test anything they put out at all on Windows.
| EvanAnderson wrote:
| I haven't seen the file, but surely each build artifact
| should be signed and verified when it's loaded by the client.
| The failure mode of bit rot / malice in the CDN should be
| handled.
| gjsman-1000 wrote:
| Perhaps - but if I made a list of all of the things your
| company _should_ be doing and didn 't, or even things that
| your side project _should_ be doing and didn 't, or even
| things in your personal life that you _should_ be doing and
| haven 't, I'm sure it would be very long.
| EvanAnderson wrote:
| A company deploying kernel-mode code that can render huge
| numbers of machines unusable should have done better.
| It's one of those "you had one job" kind of situations.
|
| They would be a gigantic target for malware. Imagine
| pwning a CDN to pwn millions of client computers. The CDN
| being malicious would be a major threat.
| soraminazuki wrote:
| Oh, they have one job for sure. Selling compliance. All
| else isn't their job, including actual security.
|
| Antiviruses are security cosplay that works by using a
| combination of bug-riddled custom kernel drivers and
| unsandboxed C++ parsers running with the highest level of
| privileges to tamper with every bit of data it can get
| its hands on. They violate every security common sense.
| They also won't even hesitate to disable or delay
| rollouts of actual security mechanisms built into
| browsers and OSes if it gets in the way.
|
| The software industry needs to call out this scam and put
| them out of business sooner than later. This has been the
| case for at least a decade or two and it's sad that
| nothing has changed.
|
| https://ia801200.us.archive.org/1/items/SyScanArchiveInfo
| con... https://robert.ocallahan.org/2017/01/disable-your-
| antivirus-...
| heraldgeezer wrote:
| Nope, I have seen software like Crowdstrike, S1, Huntress
| and Defender E5 stop active ransomware attacks.
| soraminazuki wrote:
| That anecdote doesn't justify installing gaping security
| holes into the kernel with those tools. Actual security
| requires knowledge, good practice, and good engineering.
| Antiviruses can never be a substitute.
| cduzz wrote:
| Which is their "One Job" ?
|
| Options include:
|
| 1. protected the systems always work even if things are
| messed up
|
| 2. protected systems are always protected even when
| things are messed up
|
| The two failure modes are exclusive; ideally you let the
| end user decide what to do if the protection mechanism is
| itself unstable.
|
| One could suggest "the system must always work" but
| that's ignoring that sometimes things don't go to plan.
|
| None of the systems in boot loops were p0wned by known
| exploits while they were boot looping. As far as we know
| anyhow.
|
| (edited to add the obvious default of "just make a
| working system" which is of course both a given and not
| going to happen)
| bn-l wrote:
| I think in this case it's reasonable for us to expect
| that they are doing what they _should_ be doing.
| jjav wrote:
| > all of the things your company should be doing and
| didn't
|
| Processes need to match the potential risk.
|
| If your company is doing some inconsequential social app
| or whatever, then sure, go ahead and move fast and break
| things if that's how you roll.
|
| If you are a company, let's call them Crowdstrike, that
| has access to push root privileged code to a significant
| percentage of all machine on the internet, the minimum
| quality bar is vastly higher.
|
| For this type of code, I would expect a comprehensive
| test suite that covers everything and a fleet of QA
| machines representing every possible combination of
| supported hardware and software (yes, possibly thousands
| of machines). A build has to pass that and then get
| rolled into dogfooding usage internally for a while. And
| then very slowly gets pushed to customers, with
| monitoring that nothing seems to be regressing.
|
| Anything short of that is highly irresponsible given the
| access and risk the Crowdstrike code represents.
| Denvercoder9 wrote:
| > A build has to pass that and then get rolled into
| dogfooding usage internally for a while. And then very
| slowly gets pushed to customers, with monitoring that
| nothing seems to be regressing.
|
| That doesn't work in the business they're in. They need
| to roll out definition updates quickly. Their clients
| won't be happy if they get compromised while CrowdStrike
| was still doing the dogfooding or phased rollout of the
| update that would've prevented it.
| xyst wrote:
| Hindsight is 20/20
|
| This is a public company after all. In this market, you
| don't become a "Top-Tier Cybersecurity Company At A Premium
| Valuation" with amazing engineering practices.
|
| Priority is sales, increasing ARR, and shareholders.
| fsloth wrote:
| This is the market. Good engineering practices don't hurt
| but they are not mandatory. If Boeing can wing it so can
| everybody.
| StressedDev wrote:
| Boeing has been losing market share to AirBus for
| decades. That is what happens when you cannot fix your
| problems, sell a safe product, keep costs in line, etc.
| MBCook wrote:
| That's too much of an excuse.
|
| This isn't hindsight. It's "don't blow up 101" level
| stuff they messed up.
|
| It's not that this got past their basic checks, they
| don't appear to have had them.
|
| So let's ask a different question:
|
| The file parser in their kernel extension clearly never
| expected to run into an invalid file, and had no
| protections to prevent it from doing the wrong thing _in
| the kernel_.
|
| How much you want to bet that module could be trivially
| used to do a kernel exploit early in boot if you managed
| to feed it your "update" file?
|
| I bet there's a good pile of 0-days waiting to be found.
|
| And this is _security software_.
|
| This is "we didn't know we were buying rat poison to put
| in the bagels" level dumb.
|
| Not "hindsight is 20/20".
| SoftTalker wrote:
| Truly an "the emperor has no clothes" moment.
| StressedDev wrote:
| Not caring about the actual product will eventually kill
| a company. All companies have to constantly work to
| maintain and grow their customer base. Customers will
| eventually figure out if a company is selling snake oil,
| or a shoddy product.
|
| Also, the tech industry is extremely competitive. Leaders
| frequently become laggards or go out of business. Here
| are some companies who failed or shrank because their
| products could not complete: IBM, Digital Equipment, Sun,
| Borland, Yahoo, Control Data, Lotus (later IBM),
| Evernote, etc. Note all of these companies were at some
| point at the top of their industry. They aren't anymore.
| worik wrote:
| > Not caring about the actual product will eventually
| kill a company.
|
| Eventually
|
| By then the principles are all very rich, and no longer
| care.
|
| Do you think Bill Gates sleeps well?
| geodel wrote:
| Keyword is _eventually_. By then C-level would 've been
| retired. Others in top management would've changed
| multiple jobs.
|
| IMO point is not where are these past top companies now
| but where are top people in those companies now. I
| believe they end up being in very comfortable situation
| no matter which place.
|
| Exceptions of course would be criminal prosecution,
| financial frauds etc.
| AdamJacobMuller wrote:
| The file was just full of null bytes.
|
| It's very possible the signature validation and
| verification happens after the bug was triggered.
| wk_end wrote:
| "Load a kernel module and _then_ verify it " is not the
| way any remotely competent engineer would do things.
|
| (...which doesn't rule out the possibility that CS was
| doing it.)
| justinclift wrote:
| The ClownStrike Falcon software that runs on both Linux
| and macOS was _incredibly_ flaky and a constant source of
| kernel problems at my previous work place. We had to push
| back on it regardless of the security team 's (strongly
| stated) wishes, just to keep some of the more critical
| servers functional.
|
| Pretty sure "competence" wasn't part of the job
| description of the ClownStrike developers, at least for
| those pieces. :( :( :(
| soraminazuki wrote:
| ClownStrike left kernel panics unfixed for a year until
| macOS deprecated kernel extensions altogether. It was
| scary because crash logs indicated that memory was
| corrupted while processing network packets. It might've
| been exploitable.
| usr1106 wrote:
| Haven't used Windows for close to 15 years, but I read
| the file is (or rather supposed to be) a NT kernel
| driver.
|
| Are those drivers signed? Who can sign them? Only
| Microsoft?
|
| If it's true the file contained nothing but zeros that
| seems to be also kernel vulnerability. Even if signing
| were not mandatory, shouldn't the kernel check for some
| structure, symbol tables or the the like before
| proceeding?
| dagaci wrote:
| Think more, imagine that the your CrowdStrike security
| layer detects an 'unexpected' kernel level data file.
|
| Choice #1 Diable security software and continue. Choice
| #2 Stop. BSOD message contact you administrator
|
| There may be nothing wrong with the drivers.
| derefr wrote:
| Choice #3 structure the update code so that verifying the
| integrity of the update (in kernel mode!) is upstream of
| installing the update / removing the previous definitions
| package, such that a failed update (for _whatever_
| reason) results in the definitions remaining in their
| existing pre-update state.
|
| (This is exactly how CPU microcode updates work -- the
| CPU "takes receipt" of the new microcode package, and
| integrity-verifies it internally, before starting to do
| anything involving updating.)
| warkdarrior wrote:
| > a failed update (for whatever reason) results in the
| definitions remaining in their existing pre-update state
|
| Fantastic solution! You just gave the attackers a way to
| stop all security updates to the system.
| rahkiin wrote:
| The file was data used by the actual driver like some
| virus database. It is not code loaded by the kernel
| poizan42 wrote:
| No the file is not a driver. It's a file loaded by a
| driver, some sort of threat/virus definition file I
| think?
|
| And yes Windows drivers are signed. If it had been a
| driver it would just have failed to load. Nowadays they
| must be signed by Microsoft, see
| https://learn.microsoft.com/en-us/windows-
| hardware/drivers/d...
| MBCook wrote:
| That was my read.
|
| The kernel driver was signed. The file it loaded as input
| with garbage data had seemingly no verification on it at
| all, and it crashed the driver and therefore the kernel.
| usr1106 wrote:
| Hmm, the driver must be signed (by Microsoft I assume).
| So they sign a driver which in turn loads unsigned files.
| That does not seem to be good security.
| anonymfus wrote:
| NT kernel drivers are Portable Executables, and kernel
| does such checks, displaying BSOD with stop code
| 0xC0000221 STATUS_IMAGE_CHECKSUM_MISMATCH if something
| went wrong.
|
| https://learn.microsoft.com/en-us/windows-
| hardware/drivers/d...
| chatmasta wrote:
| The _actual_ bug is not that they pushed out a data file
| with all nulls. It's that their kernel module crashes when
| it reads this file.
|
| I'm not surprised that there is no test pipeline for new
| data files. Those aren't even really "build artifacts." The
| software assumes they're just data.
|
| But I am surprised that the kernel module was deployed with
| a bug that crashed on a data file with all nulls.
|
| (In fact, it's so surprising, that I wonder if there is a
| known failing test in the codebase that somebody marked
| "skip" and then someone else decided to prove a point...)
|
| Btw: is that bug in the kernel module even fixed? Or did
| they just delete the data file filled with nulls?
| hansvm wrote:
| > Btw: is that bug in the kernel module even fixed? Or
| did they just delete the data file filled with nulls?
|
| Is that a real question? They definitely didn't do
| anything more than delete the file, perhaps just rename
| it.
| SoftTalker wrote:
| The instructions that my employer emailed were:
| 1. Start Windows in Safe Mode or the Windows Recovery
| Environment (Windows 11 option). 2. Navigate to the
| C:\Windows\System32\drivers\CrowdStrike directory.
| 3. Locate the file matching C-00000291*.sys and delete
| it. 4. Restart your device normally.
| chrisjj wrote:
| > it could certainly be the case that there was a "real" file
| that worked and the bug was in the "upload verified artifact
| to CDN code" or something
|
| I.e. only one link in the chain wasn't tested.
|
| Sorry, but that will not do.
|
| > We don't have the answers, but I'm not in a rush to assume
| that they don't test anything they put out at all on Windows.
|
| The parent post did not suggest they don't test anything. It
| suggested they did not test the whole chain.
| martinky24 wrote:
| From the parent comment:
|
| > it's insane to me that this size and criticality of a
| company doesn't have a staging or even a development test
| server that tests all of the possible target images that
| they claim to support
|
| I know nothing about Crowdstrike, but I can guarantee that
| "they need to test target images that they claim to
| support" isn't what went wrong here. The implication that
| they don't test against Windows is so incredulous, it's
| hard to take the poster of that comment seriously.
| StressedDev wrote:
| Thank you for pointing this out. Whenever I read articles
| about security, or reliability failures, it seems like
| the majority of the commenters assume that the person or
| organization which made the mistake is a bunch of bozos.
|
| The fact is mistakes happen (even huge ones), and the
| best thing to do is learn from the mistakes. The other
| thing people seem to forget is they are probably doing a
| lot of the same things which got CrowdStrike into
| trouble.
|
| If I had to guess, one problem may be that CrowdStrike's
| Windows code did not validate the data it received from
| the update process. Unfortunately, this is very common.
| The lesson is to validate any data received from the
| network, from an update process, received as user input,
| etc. If the data is not valid, reject it.
|
| Note I bet at least 50% of the software engineers
| commenting in this thread do not regularly validate
| untrusted data.
| 0xcafecafe wrote:
| They could even have done slow rollouts. Roll it out to a
| geographical region and wait an hour or so before deploying
| elsewhere.
| xyst wrote:
| Or test in local environments first. Slow rollouts like this
| tend to make deployments very very painful.
| koliber wrote:
| Slow rollouts can be quite quick. We used to do 3-day
| rollouts. Day one was a tiny fraction. Day two was about
| 20%. Day three was a full rollout.
|
| It was ages ago, but from what I remember, the first day
| rollout did occasionally catch issues. It only affected a
| small number of users and the risk was within the tolerance
| window.
|
| We also tested locally before the first rollout.
| rplnt wrote:
| I don't know about this particular update, but when I
| used to work for an AV vendor we did like 4 "data"
| updates a day. It is/was about being quick a lot of the
| time, you can't stage those over 3 days. Program updates
| are different, drivers of this level were very different
| (Microsoft had to sign those, among many things).
|
| Not thay it exuces anything, just that this probably
| wasn't treated as an update at all.
| daseiner1 wrote:
| You say _even_ (emphasis mine). Is this not industry
| standard?
| saati wrote:
| In theory CrowdStrike protects you from threats, leaving
| regions unprotected for an hour would be an issue.
| Thaxll wrote:
| Not really, even for security updates are not needed by the
| minute. Do you think Microsoft rollout world wide updates
| to everyone?
| notabee wrote:
| Without delving into any kind of specific conspiratorial
| thinking, I think people should also include the possibility
| that this was malicious. It's much more likely to be
| incompetence and hubris, but ever since I found out that this
| is basically an authorized rootkit, I've been concerned about
| what happens if another Solarwinds incident occurs with
| Crowdstrike or another such tool. And either way, we have the
| answer to that question now: it has extreme consequences. We
| really need to end this blind checkbox compliance culture and
| start doing real security.
| sonotathrowaway wrote:
| That's not even getting into the fuckups that must have
| happened to allow a bad patch to get rolled out everywhere all
| at once.
| carterschonwald wrote:
| The strange thing is that when I interviewed there years ago
| with the team that owns the language that runs in the kernel,
| they said their ci has 20k or 40k machine os
| combinations/configurations. Surely some of them were vanilla
| windows!
| dboreham wrote:
| They used synthetic test data in CI that doesn't consist of
| zeros.
| dlisboa wrote:
| Fuzz testing would've saved the day here.
| azemetre wrote:
| I'm sure some team had it in their backlog for years.
| 0x6c6f6c wrote:
| Oh yeah, FEAT#927261? Would love to see that ticket go
| out
| queuebert wrote:
| That team was probably laid off because they weren't
| shipping product fast enough.
| russdill wrote:
| You can have all the CI, staging, test, etc. If some bug after
| that process nulls the file, the rest doesn't matter
| Jtsummers wrote:
| If a garbage file is pushed out, the program could have
| handled it by ignoring it. In this case, it did not and now
| we're (the collective IT industry) dealing with the
| consequences of one company that can't be bothered to
| validate its input (they aren't the only ones, but this is a
| particularly catastrophic demonstration of the importance of
| input validation).
| russdill wrote:
| I'll agree that this appears to have been preventable.
| Whatever goes through CI should have a hash, deployment
| should validate that hash, and the deployment system itself
| should be rigorously tested to insure it breaks properly if
| the hash mismatches at some point in the process
| fabian2k wrote:
| Those signature files should have a checksum, or even a
| digital signature. I mean even if it doesn't crash the entire
| computer, a flipped bit in there could still turn the entire
| thing against a harmless component of the system and lead to
| the same result.
| HL33tibCe7 wrote:
| What happens when your mechanism for checksumming doesn't
| work? What happens when your mechanism for installing after
| the checksum is validated doesn't work?
|
| It's just too early to tell what happened here.
|
| The likelihood is that it _was_ negligence. But we need a
| proper post-mortem to be able to determine one way or
| another.
| LorenPechtel wrote:
| Yup. I had quite a battle with some sort of system bug (never
| fully traced) where I wrote valid data but what ended up on
| disk was all zero. It appeared to involve corrupted packets
| being accepted as valid.
|
| It doesn't matter how much you test if something down the
| line zeroes out your stuff.
| Cerium wrote:
| What sort of sane system modifies the build output after
| testing?
|
| Our release process is more like: build and package, sign
| package, run CI tests on signed package, run manual tests on
| signed package, release signed package. The deployment
| process should check those signatures. A test process should
| by design be able to detect any copy errors between test and
| release in a safe way.
| hnlmorg wrote:
| It seems unlikely that a file entirely full of null characters
| was the output of any automated build pipeline. So I'd wager
| something got built, passed the CI tests, then the system broke
| at some point after that when the file was copied ready for
| deployment.
|
| But at this stage, all we are doing is speculating.
| dagaci wrote:
| /* Acceptance criteria #1: do not allow machine to boot if
| invalid data signatures are present, this could indicate a
| compromised system. Booting could cause presidents diary to
| transmit to rival 'Country' of the week */
|
| if(dataFileIsNotValid) { throw FatalKernelException("All your
| base are compromised"); }
|
| EDIT+ Explanation:
|
| With hindsight not booting may be exactly the right thing to do
| since a bad datafile would indicate a compromised distribution/
| network.
|
| The machines should not fully boot until file with valid
| signature is downloaded.*
| arp242 wrote:
| > Like it's insane to me that this size and criticality of a
| company doesn't have a staging or even a development test
| server that tests all of the possible target images that they
| claim to support.
|
| Who is saying they don't have that? Who is saying it didn't
| pass all of that?
|
| You're making tons of assumptions here.
| martinky24 wrote:
| Yeah... the comment above reads like someone who has read a
| lot of books on CI deployment, but has zero experience in a
| real world environment actually doing it. Quick to throw
| stones with absolutely no understanding of any of the nuances
| involved.
| chrisjj wrote:
| So let's hear the "nuances" that excuse this.
| cweld510 wrote:
| It's not a matter of excusing or not excusing it.
| Incidents like this one happen for a reason, though, and
| the real solution is almost never "just do better."
|
| Presumably crowdstrike employs some smart engineers. I
| think it's reasonable to assume that those engineers know
| what CI/CD is, they understand its utility, and they've
| used it in the past, hopefully even at Crowdstrike.
| Assuming that this is the case, then how does a bug like
| this make it into production? Why aren't they doing the
| things that would have prevented this? If they cut
| corners, why? It's not useful or productive to throw
| around accusations or demands for specific improvements
| without answering questions like these.
| arp242 wrote:
| I am not defending of excusing anything. I am saying
| there is not enough information to make a judgement one
| way or the other. Right now, we have almost zero
| technical details.
|
| Call me old-fashioned and boring, but I'd like to have
| some basic facts about the situation first. After this I
| decide who does and doesn't deserve a bollocking.
| chrisjj wrote:
| I think we do have enough info to judge e.g. :This should
| not have passed a competent C/I pipeline for a system in
| the critical path."
|
| Thay info includes that the faulty file consisted
| entirely of zeros.
| arp242 wrote:
| > That info includes that the faulty file consisted
| entirely of zeros.
|
| Even that is not certain. Some people are reporting that
| this isn't the case and that the all-zeroed file may be a
| "quick hack" to send out a no-op.
|
| So no, we have very little info.
| jacobr1 wrote:
| Not an excuse - they should be testing for this exact
| thing - but Crowdstrike (and many similar security tools)
| have a separation between "signature updates" and
| "agent/code" updates. My (limited) reading of this
| situation is that this as a update of their "data" not
| the application. Now apparently the dynamic update
| included operating code, just just something the
| equivalent of a yaml file or whatever, but I can see how
| different kinds of changes like this go through different
| pipelines. Of course, that is all the more reason to
| ensure you have integration coverage.
| AndrewKemendo wrote:
| There is no nuance needed - this is a giant corporation
| that sells kernel layer intermediation at global scale. You
| better be spending billions on bulletproof deployment
| automation because *waves hands around in the air pointing
| at whats happening just like with solarwinds*
|
| Bottom line this was avoidable and negligent
|
| For the record I owned global infrastructure as CTO for the
| USAF Air Operations weapons system - one of the largest
| multi-classification networked IT systems ever created for
| the DoD - even moreso during a multi-region refactor as a
| HQE hire into the AF
|
| So I don't have any patience for millionaires not putting
| the work in when it's critical infrastructure
|
| People need to do better and we need accountability for
| people making bad decisions for money saving
| arp242 wrote:
| Almost everything that goes wrong in the world is
| avoidable one way or the other. Simply stating "it was
| avoidable" as an axiom is simplistic to the point of
| silliness.
|
| Lots of very smart people have been hard at work to
| prevent airplanes from crashing for many decades now, and
| planes still crash for all sorts of reasons, usually
| considered "avoidable" in hindsight.
|
| Nothing is "bulletproof"; this is a meaningless buzzword
| with no content. The world is too complex for this.
| HL33tibCe7 wrote:
| > You better be spending billions on bulletproof
| deployment automation
|
| There is no such thing.
| JKCalhoun wrote:
| To be sure. But the fact is the release broke.
|
| I'm not sure: is having test servers that it passed any
| better than none at all?
| martinky24 wrote:
| Yes, yes it is. Because there's tons more breakages that
| have likely been caught.
|
| One uncaught downstream failure doesn't invalidate the
| effort into all the previously caught failures.
| strken wrote:
| It is absolutely better to catch some errors than none.
|
| In this case it gives me vibes of something going wrong
| _after_ the CI pipeline, during the rollout. Maybe they
| needed advice a bit more specific than "just use a staging
| environment bro", like "use checksums to verify a release
| was correctly applied before cutting over to the new
| definitions" and "do staged rollouts, and ideally release
| to some internal canary servers first".
| exe34 wrote:
| I don't understand why you wouldn't do staged roll outs
| at this scale. even a few hours delay might have been
| enough to stop the release from going global.
| martinky24 wrote:
| "Have these idiots even heard of CI/CD???" strangely
| seems to be a common condescending comment in this
| thread.
|
| I honestly though HN was slightly higher quality than
| most of the comments here. I am proven wrong.
| kristjansson wrote:
| Big threads draw a lot of people; we regress toward the
| mean
| StressedDev wrote:
| Agreed - The worst part is most of the people making
| these unhelpful comments are probably doing the same
| sorts of things which caused this outage.
| chuckadams wrote:
| > I honestly though HN was slightly higher quality
|
| HN reminds me of nothing so much as Slashdot in the early
| 2000's, for both good and ill. Fewer stupid memes about
| Beowulf Clusters and Natalie Portman tho.
| chuckadams wrote:
| They almost certainly have such a process, but it got
| bypassed by accident, probably got put into a "minor
| updates" channel (you don't run your model checker every
| time you release a new signature file after all).
| Surprise, business processes have bugs too.
|
| But naw, must be every random commentator on HN knows how
| to run the company better.
| chatmasta wrote:
| The release didn't break. A data file containing nulls was
| downloaded by a buggy kernel module that crashed when
| reading the file.
|
| For all we know there is a test case that failed and they
| decided to push the module anyway ("it's not like anyone is
| gonna upload a file of all nulls").
|
| Btw: where are these files sourced from? Could a malicious
| Crowdstrike customer trick the system into generating this
| data file, by e.g. reporting it saw malware with these
| (null) signatures?
| leptons wrote:
| A lot of the software industry focuses on strong types,
| testing of all kinds, linting, and plenty of other
| sideshows that make programmers feel like they're in
| control, but these things only account for the problems you
| can test for and the systems you control. So what if a
| function gets a null instead of a float? It shouldn't crash
| half the tech-connected world. Software resilience is kind
| of lacking in favor of trusting that strong types and tests
| will catch most bugs, and that's good enough?
| ikiris wrote:
| Dude, the fact that it breaks directly.
|
| You sound like the guy that a few years ago tried to argue
| (the company in question) tested os code that didn't include
| any drivers for their gear's local storage. Its obvious it
| wasn't to anyone competent.
| dheera wrote:
| I don't know if people on Microsoft ecosystems even know what
| CI pipelines are.
|
| Linux and Unix ecosystems in general work by people thoroughly
| testing and taking responsibility for their work.
|
| Windows ecosystems work by blame passing. Blame Ron, the IT
| guy. Blame Windows Update. Blame Microsoft. That's how stuff
| works.
|
| It has always worked this way.
|
| But also, all the _good_ devs got offered 3X the salary at
| Google, Meta, and Apple. Have you ever applied for a job at
| CrowdStrike? No? That 's why they suck.
|
| * A disproportionately large number of Windows IT guys are
| named Ron, in my experience.
| kabdib wrote:
| That's a pretty broad brush.
| miki123211 wrote:
| Keep in mind that this was probably a data file, not
| necessarily a code file.
|
| It's possible that they run tests on new commits, but not when
| some other, external, non-git system pushes out new data.
|
| Team A thinks that "obviously the driver developers are going
| to write it defensively and protect it against malformed data",
| team B thinks "obviously all this data comes from us, so we
| never have to worry about it being malformed"
|
| I don't have any non-public info about what actually happened,
| but something along these lines seems to be the most likely
| hypothesis to me.
|
| Edit: Now what would have helped here is a "staged rollout"
| process with some telemetry. Push the update to 0.01% of your
| users and solicit acknowledgments after 15 minutes. If the vast
| majority of systems are still alive and haven't been restarted,
| keep increasing the threshold. If, at any point, too many of
| the updated systems stop responding or indicate a failure,
| immediately stop the rollout, page your on-call engineers and
| give them a one-click process to completely roll the update
| back, even for already-updated clients.
|
| This is exactly the kind of issue that non-invasive, completely
| anonymous, opt-out telemetry would have solved.
| adzm wrote:
| This was a .dll in all but name fwiw.
| ar_lan wrote:
| > tests all of the possible target images that they claim to
| support.
|
| Or even at the very least the most popular OS that they
| support. I'm genuinely imagining right now that for this
| component, the entirety of the company does not have a single
| Windows machine they run tests on.
| bryanlarsen wrote:
| Segue: What the heck is about Windows files and null characters?
| I've been almost exclusively dealing with POSIX file systems for
| the last 30 years, but I'm currently shipping a cross-platform
| app and a lot of my Windows users are encountering corrupted
| files which exhibit a bunch of NULs in random places in the file.
| I've added ugly hacks to deal with them but it'd be nice to get
| down to root causes. Is there a handy list of things that are
| safe on POSIX but not on Windows so I can figure out what I'm
| doing wrong?
|
| I'm at the stage where I'm thinking "%$#@ this, I'm never going
| to write to the Windows file system again, I'm just going to
| create an SQLite DB and write to that instead". Will that fix my
| problems?
| nradov wrote:
| The Windows NTFS is safe and reliable. It doesn't corrupt
| files. You have probably misunderstood the problem.
| bryanlarsen wrote:
| However I'm using it it certainly isn't. It's certainly quite
| likely the problem is me, not Windows.
|
| Most of the files that are getting corrupted are being
| written to in an append-only fashion, which is generally one
| of the mechanisms for writing to files to avoid corruption,
| at least on POSIX.
| omoikane wrote:
| I have observed a file filled NULs that was caused by a power
| loss in the middle of a write -- my UPS alerted me that
| utility power is gone, I tried to shutdown cleanly, but the
| battery died before a clean shutdown completed. This was NTFS
| on a HDD and not a SSD.
|
| I am not saying it happens often, but it does happen once in
| a while.
| bryanlarsen wrote:
| Yes, corruption does appear to be correlated with power
| cycles.
| ale42 wrote:
| Had the same on journaled ext4 on Linux. Lots of NULL bytes
| in the middle of the syslog because of unclean shutdown.
| dist-epoch wrote:
| NTFS guarantees file-system metadata integrity, not file
| data integrity. Subtle but important difference.
|
| The file was corrupted, but the file-system remained
| consistent.
| tatersolid wrote:
| Returning all zeros randomly is one of the failure modes of
| crappy consumer SSDs with buggy controllers. Especially those
| found in cheap laptops and on Amazon. If it's a fully
| counterfeit drive it might even be maliciously reporting a size
| larger than it has flash chips to support. It will accept
| writes without error but return zeros for sectors that are
| "fake". This can appear random due to wear-leveling in the
| controller.
| alexisread wrote:
| I'd guess yes, and you get the SQL goodness to boot :)
|
| Sounds like you have an encoding issue somewhere, windows has
| it's own charset - Windows-1252, so I'd vet all your libs that
| touch the file (including eg. .Net libs etc). If one of them
| defaults to that encoding you may get it either mislabelling
| the file encoding, or adding in null after each append etc.
|
| SQLite is tested cross-platform so 100% the file will be cross-
| platform compatible.
| dist-epoch wrote:
| A lot of your users encountering file corruption on Windows
| either means that somehow your users are much more likely to
| have broken hardware, or more realistically that you have a bug
| in your code/libraries.
| neffy wrote:
| There's a claim over on Mastodon from Kevin Beaumont that the
| file is different on every customer he's received the file from.
|
| https://cyberplace.social/@GossiTheDog/112812454405913406
|
| (scroll down a little)
| drewg123 wrote:
| I thought windows required all kernel modules to be signed..?
| If there are multiple corrupt copies, rather than just some
| test escape, how could they have passed the signature
| verification and been loaded by the kernel?
| dist-epoch wrote:
| This is not even a valid executable.
|
| Most likely is not loaded as a driver binary, but instead is
| some data file used by the CrowdStrike driver.
| millero wrote:
| Yes, this fits in with what I heard on the grapevine about this
| bug from a friend who knows someone working for Crowdstrike. The
| bug had been sitting there in the kernel driver for years before
| being triggered by this flawed data, which actually was added in
| a post-processing step of the configuration update - after it had
| been tested but before being copied to their update servers for
| clients to obtain.
|
| Apparently, Crowdstrike's test setup was fine for this
| configuration data itself, but they didn't catch it before it was
| sent out in production, as they were testing the wrong thing.
| Hopefully they own up to this, and explain what they're going to
| do to prevent another global-impact process failure, in whatever
| post-mortem writeup they may release.
| finaard wrote:
| You need to be a very special kind of stupid to think
| postprocessing anything after you've tested it is a good idea.
| dgfitz wrote:
| Hmm, I post-process autonomous vehicle logs probably daily.
|
| Why is this stupid? It's pretty useful to see a graph of
| coolant temp vs ambient temp vs motor speed vs roll/pitch.
|
| I must be especially stupid I suppose. Nuts.
| Flockster wrote:
| That is not remotely what was meant..
| dgfitz wrote:
| Perhaps word choice and sentence structure are important
| then.
| heylook wrote:
| What you have just said is one of the most insanely
| idiotic things I have ever heard. At no point in your
| rambling, incoherent response were you even close to
| anything that could be considered a rational thought.
| Everyone in this room is now dumber for having listened
| to it. I award you no points, and may God have mercy on
| your soul.
| jmull wrote:
| I don't think people should restate the basic context of
| the thread for every post... That's a lot of work and
| noise, and probably the same people who ignore the thread
| context would also ignore any context a post provided.
| wri321 wrote:
| This is comparable to modifying the system under test after
| it has been validated and not simply looking at recorded
| data.
| dist-epoch wrote:
| "We need to ship this by Friday. Just add a quick post-
| processing step, and we'll fix it next week properly" - how
| these things tend to happen.
| yard2010 wrote:
| In my first engineering job ever, I worked with this snarky
| boss who was mean to everyone and just said NO every time
| to everything. She also had a red line: NO RELEASES ON THE
| LAST DAY OF THE WEEK. I couldn't understand why. Now, 10
| years later, I understand I just had the best boss ever. I
| miss you, Vik.
| arp242 wrote:
| I still have a 10-year old screenshot from a colleague
| breaking production on a Friday afternoon and posting
| "happy weekend everyone!" just as the errors from
| production started to flood in on the same chat. And he
| really did just fuck off leaving us to mop up the
| hurricane of piss he unleashed.
|
| He was not my favourite colleague.
| arp242 wrote:
| "I heard on the grapevine from a friend who knows someone
| working for Crowdstrike" is perhaps not the most reliable
| source of information, due to the game of telephone effect if
| nothing else.
|
| And post-processing can mean many things. Could be something
| relatively simple such as "testing passed, so lets mark the
| file with a version number and release it".
| MilStdJunkie wrote:
| Holy smokes. I'm no programmer, but I've built out bazillions of
| publishing/conversion/analysis systems, and null purge is pretty
| much the first thing that happens, every time. x00 breaks
| virtually everything just by existing - like, seriously, markup
| with one of these damn things will make the rest of the system
| choke and die as soon as it looks at it. Numpy? Pytorch? XSL?
| Doesn't matter. _cough cough cough GACK_
|
| And my systems are all halfass, and I don't really know what I'm
| doing. I can't imagine actual real professionals letting that
| moulder its way downstream. Maybe their stuff is just way more
| complex and amazing than I can possibly imagine.
| wormlord wrote:
| Not a C programmer, why is 0x00 so bad? It's the string
| terminator character right?
| bagful wrote:
| Indeed, '\0' is the sentinel character for uncounted strings
| in C, and even if your own counted string implementation is
| "null-byte clean", aspects of the underlying system may not
| be (Windows and Unix filesystems forbid embedded null
| characters, for example).
| tedunangst wrote:
| It's a byte like any other. You're more likely to see big
| files full of 0x0 than 0x1, but it's really not so different.
| hawski wrote:
| Binary files are full of null bytes it is one of the main
| criteria of binary file recognition. Also large swaths of null
| bytes are also common, common enough we have sparse files -
| files with holes in them. Those holes are all zeroes, but are
| not allocated in the file system. For an easy example think
| about a disk image.
| j-wags wrote:
| It's possible that these aren't the original file contents, but
| rather the result of a manual attempt to stop the bleeding.
|
| Someone may have hoped that overwriting the bad file with an
| all-0 file of the correct size would make the update benign.
|
| Or following the "QA was bypassed because there was a critical
| vulnerability" hypothesis, stopping distribution of the real
| patch may be an attempt to reduce access to the real data and
| slow reverse-engineering of the vulnerability.
| 0cf8612b2e1e wrote:
| On the plus side of this disaster, I am holding out some pico-
| sized hope that maybe organizations will rethink kernel level
| access. No, random gaming company, you are not good enough to
| write kernel level anti cheat software.
| majormajor wrote:
| I can't imagine gaming software being affected at all, unless
| MS does a ton of cracking down (and would still probably give
| hooks for gaming since they have gaming companies in their
| umbrella).
|
| No corporate org is gonna bat an eye at Riot's anti-cheat
| practices, because they aren't installing LoL on their line of
| business machines anyway.
| InitialLastName wrote:
| Right, MS just paid $75e9 for a company whose main products
| are competitive multiplayer games. They are never going to be
| incentivized to compromise that sector by limiting what anti-
| cheat they can do.
| Y_Y wrote:
| That's 7.5e10 USD in SI.
| InitialLastName wrote:
| Engineering notation used for prefix convenience.
| minetest2048 wrote:
| Until the malware bring their own compromised signed anti
| cheat driver on their own, like what happened with Genshin
| Impact anti cheat mhyprot2
| tgsovlerkhgsel wrote:
| > because they aren't installing LoL on their line of
| business machines anyway
|
| But if their business is incompatible with strict software
| whitelisting, their employees might...
| pvillano wrote:
| imo anti-cheat should mostly be server-side behavior based
| gruez wrote:
| How are you going to catch wallhackers that aren't blatantly
| obvious?
| JasonSage wrote:
| You may not, and that's ok.
| chowells wrote:
| It's not ok for people playing those games. They'll quit
| playing that game and go to one with invasive client-side
| anti-cheat instead.
|
| The incentives and priorities are _very_ different for
| people who want to play fair games than they are for
| people who want to maximize their own freedom.
| kjkjadksj wrote:
| This is a solved issue already. Vote kicks or server
| admin intervention. Aimbotting was never an issue for the
| old primitive fps games I would play because admins could
| spectate and see you are aimbotting.
|
| A modern game need only telemetry that captures what a
| spectating admin picks up, rather than active
| surveillance.
|
| Hackers are only a problem when servers are left
| unmoderated and players can't vote kick.
| nemothekid wrote:
| You can't have vote kicks/server admins/hosted servers
| with competitive ranked ladders. If your solution is
| "don't have competitive ranked ladders" then you are just
| telling the majority of people who even care about anti-
| cheat to just not play their preferred game mode.
| chowells wrote:
| That stopped being a solution when winning online started
| mattering. There are real money prizes for online game
| tournaments. Weekly events can have hundreds of dollars
| in their prize pools. Big events can have thousands.
|
| Suddenly vote kicking had to go, because it was abused.
| Not in the tournaments themselves, but in open ranked
| play which serves as qualifiers. An active game can rack
| up thousands of hours of gameplay per day, far beyond the
| ability of competent admins to validate. Especially
| because cheating is often subtle. An expert can spend
| more than real time looking for subtle patterns that
| automated tools haven't been built to detect.
|
| Games aren't between you and your 25 buddies for bragging
| rights anymore. They're between you and 50k other active
| players for cash prizes. The world has changed. Anti-
| cheat technology _followed_ that change.
| JasonSage wrote:
| I play one of those games that doesn't strongly enforce
| anti-cheating, and I agree with you that it's a huge
| detraction compared to games with strong anti-cheat.
|
| But I strongly disagree about the use of invasive client-
| side anti-cheat. Server-side anti-cheat can reduce the
| number of cheaters to an acceptably low level.
|
| See for example how lichess detects and aids in detection
| of cheaters: https://github.com/clarkerubber/irwin
|
| And chess is a game where I feel like it would be
| relatively hard to detect cheating. An algorithm looking
| at games with actors moving in 3D space and responding to
| relative positions and actions of multiple other actors
| should have a great many more ways to detect cheating
| over the course of many games.
| JasonSage wrote:
| And frankly, I think the incentive structure has nothing
| to do with whether tournaments are happening with money
| on the line, and a great deal more whether the company
| has the cash and nothing better to do.
|
| Anti-cheat beyond a very basic level is nothing to these
| companies except a funnel optimization to extract the
| maximum lifetime value out of the player base. Only the
| most successful games will ever have the money or reach
| the technical capability to support this. Nobody making
| these decisions is doing it for player welfare.
| mrguyorama wrote:
| The only reason wallhacking is possible in the first place
| is a server sending a client information on a competitor
| that the client should not know about.
|
| IE the server sends locations and details about all players
| to your client, even if you are in the spawn room and can't
| see anyone else and your client has to hide those details
| from you. It is then trivial to just pull those details out
| of memory.
|
| The solution forever has been to just not send clients
| information they shouldn't have. My copy of CS:GO should
| not know about a terrorist on the other side of the map.
| The code to evaluate that literally already exists, since
| the client will answer that question when it goes to render
| visuals and sound. They just choose to not do that testing
| server side.
|
| Aimbotting however is probably impossible to stop. Your
| client has to know where the model for an enemy is to
| render it, so you know where the hitbox roughly should be,
| and most games send your client the hitbox info directly so
| it can do predict whether you hit them. I don't think you
| can do it behaviorally either.
| snailmailman wrote:
| To some extent though- the games _do_ need information
| about players that are behind walls. In CSGO /CS2, even
| if you can't see the player you can hear their footsteps
| or them reloading, etc. the sound is _very_ positional.
| Plus, you can shoot through some thin walls at these
| players. Even if they can't be _seen_.
|
| I don't believe server side anti cheat can truly be
| effective against some cheats. But also Vanguard is trash
| and makes my computer bluescreen. I've stopped playing
| league entirely because of it.
| 0cf8612b2e1e wrote:
| Nit, but surely hit detection happens on the server?
| Shooting wildly should always register a hit, regardless
| of what the client knows.
| pohuing wrote:
| You don't happen to have used some means to install win
| 11 on an unsupported device have you? People bypassing
| the windows install requirements and then vanguard making
| false assumptions have been a source of issues.
| anonymoushn wrote:
| You may have players complain that when they walk around
| a corner, the enemy who they should be able to see
| immediately is briefly invisible.
| andy81 wrote:
| Aside from aimbots, there's plenty of abusable legitimate
| information exposed to the client.
|
| E.g. For CS:GO, the volume of footsteps and gunshots vary
| by distance so you could use them to triangulate an
| enemy's position.
| bsder wrote:
| > The only reason wallhacking is possible in the first
| place is a server sending a client information on a
| competitor that the client should not know about.
|
| Some information is required to cover the network and
| server delays.
|
| The client predicts what things should look like and then
| corrects to what they actually are if there is a
| discrepancy with the server. You _cannot_ get around this
| short of going back to in-person LAN games.
| SigmundA wrote:
| So the server must render the 3d world from each players
| perspective to do these tests? Sounds ridiculously
| expensive.
| Ukv wrote:
| > So the server must render the 3d world from each
| players perspective to do these tests?
|
| Just some raycasts through the geometry should be
| sufficient, which the server is already doing (albeit on
| likely-simplified collision meshes) constantly.
|
| If you really do have a scenario where occlusion
| noticeably depends on more of the rendering pipeline (a
| window that switches between opaque and transparent based
| on a GPU shader?) you could just treat it as always
| transparent for occlusion checking and accept the tiny
| loss that wallhackers will be able to see through it, or
| add code to simulate that server-side and change the
| occlusion geometry accordingly.
| kjkjadksj wrote:
| Of course you can. You can measure telemetry like where
| the aimpoint is on a hitbox. Is it centered or at least
| more accurate than your globabl population? Hacker, ban.
| How about time to shoot after hitting target? Are they
| shooting instantly, is the delay truly random? If not
| then banned. You can effectively force the hacking tools
| to only be about as good as a human player, at which
| point it hardly matters whether you have hackers or not.
|
| Of course, no one handles hacking like this because its
| cheaper to just ship fast and early and never maintain
| your servers. Not even valve cares about their games and
| they are the most benevolent company in the industry.
| nemothekid wrote:
| Valve does not have kernel level anticheat. Faceit does.
| Most high ranked players prefer to play on Faceit because
| of the amount of cheaters in normal CS2 matchmaking.
| Ukv wrote:
| Minimize the possible advantage by not sending the client
| other players' positions until absolutely necessary (either
| the client can see the other player, or there's a movement
| the client could make that would reveal the other player
| before receiving the next packet), and eliminate the
| cheaters you can with server-side behavior analysis and
| regular less-invasive client-side anticheat.
|
| Ultimately even games with kernel anticheat have cheating
| issues; at some point you have to accept that you cannot
| stop 100.0% of cheaters. The solution to someone making an
| aimbot using a physically separate device (reading monitor
| output, giving mouse input) cannot be to require keys to
| the player's house.
| lutoma wrote:
| > not sending the client other players' positions until
| absolutely necessary (either the client can see the other
| player, or there's a movement the client could make that
| would reveal the other player before receiving the next
| packet)
|
| I think the problem with this is sounds like footsteps or
| weapons being fired that need to be positional.
|
| Which makes me wonder if you could get away with mixing
| these sounds server-side and then streaming them to the
| client to avoid sending positions. Probably infeasible in
| practice due to latency and game server performance, but
| fun to think about.
| Ukv wrote:
| To whatever extent the sound is intended to only give a
| general direction, I'd say quantize the angle and volume
| of the sound before it's sent such that cheaters also
| only get that same vague direction. Obviously don't send
| inaudible/essentially-inaudible sounds to the client at
| all.
| Workaccount2 wrote:
| They need to just make CPU's, GPU's, and memory modules
| with hardware level anti-cheat. Totally optional
| purchase, but grants you access to very-difficult-to-
| cheat-in servers.
| didntcheck wrote:
| That sort of already exists - I believe a small number of
| games demand that you have Secure Boot enabled, meaning
| you should only have a Microsoft-approved kernel and
| drivers running. And then the anticheat is itself
| probably kernel level, so can see anything in userspace
|
| It may still be possible to get round this by using your
| own machine owner key or using PreLoader/shim [1] to sign
| a hacked Windows kernel
|
| [1] https://wiki.archlinux.org/title/Unified_Extensible_F
| irmware...
| bpye wrote:
| I guess you've just invented an Xbox/PlayStation.
| Am4TIfIsER0ppos wrote:
| Standalone servers. Run your own then you can ban anyone
| you like, or better still only allow anyone you like.
| cobalt60 wrote:
| Nothing like sourcemodded server! Good old days!
| kjkjadksj wrote:
| Did their hitbox clip through the wall? Yes? Banned. You
| could do it with telemetry.
| Arnavion wrote:
| You're confusing wallhacking with noclipping. Wallhacking
| is being able to see through walls, like drawing an
| outline around all characters that renders with highest
| z-order, or making wall textures transparent.
|
| It does not result in any server-side-detectable
| difference in behavior other than the hacker seemingly
| being more aware of their surroundings than they should,
| which can be hard to determine for sure. Depending on how
| the hack is done, it may not be detectable by the client
| either, eg by intercepting the GPU driver calls to render
| the outlines or switch the wall textures.
| josephcsible wrote:
| Stop thinking about trying to catch wallhackers. Instead,
| make wallhacking impossible. Do that by fixing the server
| to, instead of sending all player positions to everyone,
| only send player positions to clients that they have an
| unobstructed view of.
| frizlab wrote:
| Unless I'm mistaken on macOS at least kernel access is just not
| possible, so at least there's that.
| pityJuke wrote:
| The problem you're fighting is cheat customers who go "random
| kernel-level driver? no problem!"
| hn_throwaway_99 wrote:
| On a related note, I don't think that it's a coincidence that 2
| of the largest tech meltdowns in history (this one and the
| SolarWinds hack from a few years ago) were both the result of
| "security software" run amok. (Also sad that both of these
| companies are based in Austin, which certainly gives Austin's
| tech scene a black eye).
|
| IMO, I think a root cause issue is that the "hacker types" who
| are most likely to want to start security software companies are
| also the least likely to want to implement the "boring" pieces of
| a process-oriented culture. For example, I can't speak so much
| for CrowdStrike, but it came out the SolarWinds had an
| _egregiously_ bad security culture at their company. When the
| root cause comes out about this issue dollars-to-donuts it was
| just a fast and loose deployment process.
| koliber wrote:
| Don't forget heartbleed, a vulnerability in OpenSSL, the
| software that secures pretty much everything.
| NegativeK wrote:
| Alternate hypothesis that's not necessarily mutually exclusive:
| security software tends to need significant and widespread
| access. That means that fuckups and compromises tend to be more
| impactful.
| hn_throwaway_99 wrote:
| 100% agree with that. The thing that baffles me a bit, then,
| is that if you are writing software that _is_ so critical and
| can have such a catastrophic impact when things go wrong,
| that you double and triple check everything you do - what you
| DON 'T do is use the same level of care you may use with some
| social media CRUD app (move fast and break things and all
| that...)
|
| To emphasize, I'm really just thinking about the bad
| practices that were reported after the SolarWinds hack
| (password of "solarwinds123" and a bunch of other insider
| reports), so I can't say that totally applies to CrowdStrike,
| but in general I don't feel like these companies that can
| have such a catastrophic impact take appropriate care of
| their responsibilities.
| compacct27 wrote:
| The Austin tech culture is...interesting. I stopped trying to
| find a job here and went remote Bay Area, and talking to tech
| workers in the area gave me the impression it's a mix of
| slacker culture and hype chasing. After moving back here, tech
| talent seems like a game of telephone, and we're several jumps
| past the original.
|
| When I heard CrowdStrike was here, it just kinda made sense
| moandcompany wrote:
| Crowdstrike was originally founded and headquartered in Irvine,
| CA (Southern California). In those days, most of its
| engineering organization was either remote/WFH or in Irvine,
| CA.
|
| As they got larger, they added a Sunnyvale office, and later
| moved the official headquarters to Austin, TX.
|
| They've also been expanding their engineering operations
| overseas which likely includes offshoring in the last few
| years.
| nullify88 wrote:
| They bought out Humio in Aarhus, Denmark. Now Falcon
| Logscale.
| ajsnigrutin wrote:
| Security software needs kernel level access.. if something
| breaks, you get boot loops and crashes.
|
| Most other software doesn't need that low level of access, and
| even if it crashes, it doesn't take the whole system with it,
| and a quick, automated upgrade process is possible.
| rahkiin wrote:
| Security software needs kernel level access.. *on Windows.
| macOS has an Endpoint Security userland extension api
| sharkjacobs wrote:
| This seems like a pretty clear example of the philosophical
| divide between MacOS and Windows.
|
| A good developer with access to the kernel can create
| "better" security software which does less context
| switching and has less performance impact. But a bad
| (incompetent or malicious) developer can do a lot more harm
| with direct access to the kernel.
|
| We see the exact same reasoning with restricting execution
| of JIT-compiled code in iOS.
| cedws wrote:
| > the "hacker types" who are most likely to want to start
| security software companies are also the least likely to want
| to implement the "boring" pieces of a process-oriented culture
|
| I disagree, security companies suffer from "too big to fail"
| syndrome where the money comes easy because they have customers
| who want to check a box. Security is NOT a product you pay for,
| it's a culture that takes active effort and hard work to embed
| from day one. There's no product on the market that can provide
| security, only products to point a finger at when things go
| wrong.
| Andrex wrote:
| The market is crying for some kind of '10s "agile hype"
| equivalent for security evangelism and processes.
| OutOfHere wrote:
| This looks like a test file that got deployed. Perhaps a QA test
| was newly added which ran and overwrote the build. This is all I
| can think of.
| markus_zhang wrote:
| I'm starting to think that the timing (Friday) and the scale as
| well as other things (like this finding) might -- just might
| point to a bad actor.
|
| We will probably have to wait for CS' own report.
| breadwinner wrote:
| I blame Microsoft. Why? Because they rely on third parties to
| fill in their gaps. When I buy a Mac it already has drivers for
| my printers, but not if I buy a Windows PC. Some of these
| printers drivers are 250 MB, which is a crazy size for a driver.
| If it is more than a few 100 KB it means the manufacturer does
| not know how to make a driver software. Microsoft should make it
| unnecessary to rely on crappy third party software so much.
| luuurker wrote:
| CrowdStrike's mess up is CrowdStrike's fault, not Microsoft's.
| We might not like the way Windows works, but it usually works
| fine and more restrictive systems also have downsides. In any
| case, it was CrowdStrike who dropped the ball and created this
| mess.
|
| I don't like what Microsoft is doing with Windows and only use
| it for gaming (I'm glad Linux is becoming a good option for
| that), so I'm far from being a "Microsoft fan", but Windows is
| very good at installing the software needed. Plug a GPU, mouse,
| etc, from any well known brand and it should work without you
| doing much.
|
| I didn't have to install anything on my Windows PC (or my MBP)
| last time I bought a new printer (Epson). The option to let
| Windows install the drivers needed is enabled though... some
| people disable that.
| breadwinner wrote:
| > _CrowdStrike 's mess up is CrowdStrike's fault, not
| Microsoft's._
|
| Disagree. It is everyone's fault. It is CrowdStrike's fault
| for not testing their product. It is Microsoft's fault for
| allowing CrowdStrike to mess with kernel and not vetting such
| critical third parties. It is the end customers' fault for
| installing crapware and not vetting the vendor.
| yoavm wrote:
| so now we're vouching for more restrictive operating
| systems? the last thing I want is an operating system that
| can only install vetted apps, and that these apps are
| restricted even if I provide my root password.
| luuurker wrote:
| We expect different things from the OS we use, I guess.
|
| My main machine is a Macbook Pro and one thing that annoys
| me a lot is the way Apple handles apps that are not
| notarized. I don't use iPhones because of the system
| restrictions (file access, background running, etc) and
| because I can only install what Apple allows on their
| store. You can see why I don't want Microsoft to hold my
| hand when I use Windows... it's my machine, I paid for it,
| I should be able to install crapware and extend the system
| functionality if that's what I want especially when I pick
| an OS that allows me to do that.
|
| In this case, enterprise customers decided to use an OS
| that allows them to also use CrowdStrike. Maybe Microsoft
| could handle this stuff better and not show a BSOD? I guess
| so, but I won't blame them for allowing these tools to
| exist.
|
| Don't get me wrong, there's a place for very restrictive
| operating systems like iOS or ChromeOS, but they're not for
| everyone or enough for all tasks. Windows is a very capable
| OS, certainly not the best option for everyone, but the day
| Microsoft cripples Windows like that, it's the day I am
| forced to stop using it.
| Alghranokk wrote:
| I think this is unfair; m$ does provide perfectly usable
| generic printer drivers, as long as you only use basic
| universal features. The problem is that the printer producers
| each want to provide a host of features on top of that, each in
| their own proprietary way, with post print hole-punching, 5
| different paper trays, user boxes, 4 different options for
| duplex printing.
|
| Also, label printers, why the heck does zebra only do EPL or
| ZPL? Why not pcl6 or PS like the rest of the universe?
|
| The point is that printers are bullshit. Nobody knows how they
| work, and assuming that microsoft should just figure it out on
| its own is at least in my opinion, unreasonable.
| breadwinner wrote:
| What Windows was known for in the 1990s, is good quality 1st
| party drivers. Then after Windows achieved monopoly status
| they shifted driver responsibility to device manufacturers. I
| have never had to install a third party driver on a Mac, but
| on Windows I do. If Apple can do it Microsoft can too.
| mardifoufs wrote:
| Which printer did you try it with? I've never had issues with
| printing out of the box with windows or mac. At least not for
| the past 5 years.
|
| Also, I'm glad Microsoft doesn't provide an easy way to get
| what is essentially complete control over a machine, and every
| single event/connection/process that it has.
| cmrdporcupine wrote:
| Is there not responsibility at some level as well to _Microsoft_
| for having a kernel which even _loaded_ this? Not just because of
| the apparent corruption, but also ... it was, I heard.. signed
| and given a bit of an MS blessing.
|
| This crap shouldn't be run in kernel space. But putting that
| aside, we need kernels that will be resilient to and reject this
| stuff.
| ale42 wrote:
| The thing is that, despite the file has a confusing .sys
| extension, it's not the driver, but rather a data file loaded
| by the Crowdstrike driver.
| ThinkBeat wrote:
| Maybe Crowdstrike has adopted the modern ethos Move fast and
| break things With continuous integration we ship a thousand times
| a day. fuck QA.
| motohagiography wrote:
| conspiracy prediction: I don't think CS will give a complete
| public RCA on it, but I do think the impact and crisis will be a
| pretext for granting new internet governance powers, maybe via
| EO, or a co-ordinated international response via the UN/ITU and
| the EU.
| cherryteastain wrote:
| EU recently passed a law in this domain:
| https://www.eiopa.europa.eu/digital-operational-resilience-a...
| kragen wrote:
| this seems like the second or third test file any qa person would
| have tried, after an empty file and maybe a minimal valid file.
| the level of pervasive incompetence implied here is staggering
|
| in a market where companies compete by impressing nontechnical
| upper management with presentations, it should be no surprise
| that technically competent companies have no advantage over
| incompetent ones
|
| i recently read through the craig wright decision
| https://www.judiciary.uk/judgments/copa-v-wright/ (the guy who
| fraudulently claimed to be satoshi nakamoto) and he lacked even
| the most basic technical competence in the fields where he was
| supposedly a world-class specialist (decompiling malware to c);
| he didn't know what 'unsigned' meant when questioned on the
| witness stand. he'd been doing infosec work for big companies
| going back to the 90s. he'd apparently been tricking people with
| technobabble and rigged demos and forged documents for his entire
| career
|
| george kurtz, ceo and founder of crowdstrike, was the cto of
| mcafee when they did the exact same thing 14 years ago:
| https://old.reddit.com/r/sysadmin/comments/1e78l0g/can_crowd...
| https://en.wikipedia.org/wiki/George_Kurtz
|
| it's horrifying that pci compliance regulations have injected
| crowdstrike (and antivirus) into virtually every aspect of
| today's it infrastructure
| GardenLetter27 wrote:
| Also ironic that the compliance ended up introducing the
| biggest vulnerability as a massive single point of failure.
|
| But that's government regulation for you.
| kragen wrote:
| pci-dss is not a government agency but it might as well be;
| it's a collusion between visa, mastercard, american express,
| discover, and jcb to prevent them from having different data
| security standards (and therefore being able to compete on
| security)
| derefr wrote:
| You mean "and therefore requiring businesses that take
| credit cards to enforce the union of all the restrictions
| imposed by all six companies (which might not even be
| possible--the restrictions might be contradictory) in order
| to accept all six types of cards"
| acdha wrote:
| > But that's government regulation for you.
|
| You misspelled "private sector". Use of endpoint monitoring
| software is coming out of private auditing companies driven
| by things like PCI or insurers' requirements - almost nobody
| wants to pay for highly-skilled security people so they're
| outsourcing it to the big auditing companies and checklists
| so that if they get sued they can say they were following
| industry practices and the audit firms okayed it.
| babypuncher wrote:
| this had nothing to do with government regulation, thank
| private sector insurance companies.
| czbond wrote:
| Pci-dss is a method for card companies to allay the risk
| onto the merchant and away from the card companies - just
| like insurance.
| moandcompany wrote:
| It's definitely ironic, and compatible with the security
| engineering world joke that the most secure system is one
| that cannot be accessed or used at all.
|
| I suppose one way to "stop breaches" is to shut down every
| host entirely.
|
| In the military world, there is a concept of an "Alpha
| Strike" which generally relates to a fast-enough and strong-
| enough first-strike that is sufficient to disable the
| adversary's ability to respond or fight back (e.g. taking
| down an entire fleet at once). Perhaps people that have been
| burned by this event will start calling it a Crowdstrike.
| phatfish wrote:
| It seems government IT systems in general faired pretty well
| the last 12 hrs, but loads of large private companies were
| effectively taken offline, so there's that.
| worstspotgain wrote:
| I don't mean to sound conspiratorial, but it's a little early
| to rule out malfeasance just because of Hanlon's Razor just
| yet. Most fuckups are not on a ridonkulous global scale. This
| is looking like the biggest one to date, the Y2K that wasn't.
| martin-t wrote:
| We as a society need to start punishing incompetence the same
| way we punish malice.
|
| Of course, we also need to first start punishing individuals
| for intentionally causing harm through their decisions even if
| the harm was caused indirectly through other people. Power
| allows people to distance themselves from the act. Distance
| should not affect the punishment.
| worik wrote:
| > We as a society need to start punishing incompetence the
| same way we punish malice.
|
| Yes
|
| But competence is marketed
|
| The trade names like "Crowdstrike" and "Microsoft "
| worik wrote:
| > george kurtz, ceo and founder of crowdstrike, was the cto of
| mcafee when they did the exact same thing 14 years ago:
| https://old.reddit.com/r/sysadmin/comments/1e78l0g/can_crowd...
|
| I find it amusing that the people commenting on that link are
| offended this called a "Microsoft " outage, when it is
| "Crowdstrike's fault".
|
| This is just as much a Microsoft failure.
|
| This is even more, another industry failure
|
| How many times does this have to happen before we get some
| industry reform that lets us do our jobs and build the secure
| reliable systems we have spent seven decades researching?
|
| 1988 all over again again again
| TeMPOraL wrote:
| It's simple: the failure is not specific to the OS.
|
| Crowdstrike runs on MacOS and Linux workstations too. And
| it's just as dangerous there; the big thread has stories of
| Crowdstrike breaking Linux systems in the past months.
|
| Crowdstrike isn't needed by/for Windows, it's mandated by
| corporate and government bureaucracies, where it serves as a
| tool of employee control and a compliance checkbox to check.
|
| That's why it makes no sense to blame Microsoft. If the world
| run on Linux, ceteris paribus, Crowdstrike would be on those
| machines too, and would fuck them up just as bad globally.
| baxtr wrote:
| I think the worst part of the incident is that state actors now
| have a clear blueprint for a large scale infrastructure attack.
| IAmNotACellist wrote:
| I can think of a lot better things to put in a kernel-level
| driver installed on every critical computer ever than a bunch
| of 0s.
| olliej wrote:
| We can argue all we want about CI infrastructure, manual testing,
| test nets/deployment, staged deployment.
|
| All of that is secondary: they wrote and shipped code that
| blindly loaded and tried to parse content from the network, and
| crashed when that failed. In kernel mode.
|
| Honestly it's probably good that this happened, because
| presumably someone malicious could use this level of broken logic
| to compromise kernel space.
|
| Certainly the trust they put in the safety of parsing content
| downloaded from the internet makes me wonder about the
| correctness of their code for reading data from userspace.
| bb88 wrote:
| We've had security software in the past break software
| compilation in this method by replacing entire files with zeros.
| I'm not saying this is the case, but it wouldn't surprise me if
| it were.
|
| Basically the linker couldn't open the file on windows (because
| it was locked by another process scanning it), and didn't error.
| Just replaced the object code to be linked with zeros.
|
| People couldn't figure out what was wrong until they opened a
| debugger and saw large chunks of object code replaced with zeros.
| slashdave wrote:
| I don't get it. Shouldn't the file have a standard format, with a
| required header and checksum (among other things), that the
| driver checks before executing?
| fhub wrote:
| Anytime critical infrastructure goes down I always have a
| fleeting thought back to "Spy Game" movie where the CIA cut power
| to part of a Chinese city to help with a prison escape.
| Thaxll wrote:
| I'm not versed enough into windows loading dll / driver, but
| isn't the caller able to handle that situation? Or windows
| itself? Does loading an empty file driver can be handled in a way
| that it does not make the OS crash?
| yamumsahoe wrote:
| thats a lot of prod to test in.
| jmspring wrote:
| Poor testing. But we also need to stop CISOs, etc doing "checkbox
| compliance" and installing every 3rd party thing on employee
| laptops. My prior employer, there were literally 13 things
| installed for "laptop security" - 1/2 of them overlapped.
| Developers had the same policy as an AE and as a Sales Engineer
| as well as an HR person. Crowdstrike was one of the worst.
| Updating third party packages in go was 30-40% faster in an
| emulated arm64 VM (qemu) - virtualized disk / disk just a large
| file - on an Intel MBP compared to doing the same operation on
| the native system in OSX.
___________________________________________________________________
(page generated 2024-07-19 23:03 UTC)