CrowdStrike IT outage affected 8.5 million Windows devices, Microsoft says

MicroWave@lemmy.world · 4 months ago

CrowdStrike IT outage affected 8.5 million Windows devices, Microsoft says

markr@lemmy.world · 4 months ago

There are a lot of misunderstandings about what happened. First, the ‘update’ was to a data file used by the crowdstrike kernel components (specifically ‘falcon’.) while this file has a ‘.sys’ name, it is not a driver, it provides threat definition data. It is read by the falcon driver(s), not loaded as an executable.

Microsoft doesn’t update this file, crowdstrike user mode services do that, and they do that very frequently as part of their real-time threat detection and mitigation.

The updates are essential. There is no opportunity for IT to manage or test these updates other than blocking them via external firewalls.

The falcon kernel components apparently do not protect against a corrupted data file, or the corruption in this case evaded that protection. This is such an obvious vulnerability that i am leaning toward a deliberate manipulation of the data file to exploit a discovered vulnerability in their handling of a malformed data file. I have no evidence for that other than resilience against malformed data input is very basic software engineering and crowdstrike is a very sophisticated system.

I’m more interested in how the file got corrupted before distribution.

PlutoniumAcid@lemmy.world · 4 months ago

Yeah, how the hell did this failure pass testing, is what I want to know!

lechatron@lemmy.today · 4 months ago

That’s the neat thing, Crowdstrike bypassed the rigorous testing process to get Kernel software updates signed by Microsoft by having the part that was tested and signed by Microsoft load another update file. Still unclear how Crowdstrike missed it before releasing it though.

This is a pretty good break down of what happened by a retired windows dev. Including how software operates between Kernel and user zones. The break down of what he thinks happened is about 6:40.

NegativeNull@lemmy.world · 4 months ago

The downstream effects are likely much much greater. If an auth server/DB server/API server/etc (for example) got taken down, the failure cascades

teejay@lemmy.world · 4 months ago

The idea that any such servers would be running windows… shudder

thisbenzingring@lemmy.sdf.org · 4 months ago

All i know is that I had to personally fix 450 servers myself and that doesn’t include the workstations that are probably still broke and will need to be fixed on Monday

😮‍💨

qjkxbmwvz@startrek.website · 4 months ago

Is there any automation available for this? Do you fix them sequentially or can you parallelize the process? How long did it take to fix 450?

Real clustermess, but curious what fixing it looks like for the boots on the ground.

thisbenzingring@lemmy.sdf.org · edit-2 4 months ago

Thankfully I had cached credentials and our servers aren’t bitlocker’d. Majority of the servers had iLO consoles but not all. Most of the servers are on virtual hosts so once I got the fail over cluster back, it wasn’t that hard just working my way through them. But the hardware servers without iLO required physically plugging in a monitor and keyboard to fix, which is time consuming. 10 of them took a couple hours.

I worked 11+ hours straight. No breaks or lunch. That got our production domain up and the backup system back on. The dev and test domains are probably half working. My boss was responsible for those and he’s not very efficient.

So for the most part I was able to do most of the work from my admin pc in my office.

For the majority of them, I’d use the Widows recovery menu that they were stuck at to make them boot into safe mode with network support ( in case my cached credentials weren’t up-to-date). Then start a cmd and type out that famous command

Del c:\windows\system32\drivers\crowdstrike\c-00000291*.sys

I’d auto complete the folders with tab and the 5 zero’s … Probably gonna have that file in my memory forever

Edit: one painful self inflicted problem was my password is 25 random LastPass generatied password. But IDK how I managed it, I never typed it wrong. Yay for small wins

John Richard@lemmy.world · 4 months ago

CrowdStrike will ultimately have contract terms that put responsibility on the companies, and truth be told the companies should be able to handle this situation with relative ease. Maybe the discussion here should be on the fragility of Windows and why Linux is a better option.

Darkassassin07@lemmy.ca · edit-2 4 months ago

Terms which should be void as this update was pushed to systems that explicitly disabled automatic updates.

Companies were literally raped by Crowdstrike.

/edit Sauce (bottom paragraph)

John Richard@lemmy.world · 4 months ago

Companies were not raped by CrowdStrike. They were raped by their own ineptitude.

No where have I seen evidence where these updates were disabled and still got pushed. I’m not saying it is impossible, but unlikely if they followed any common sense and best practices. Usually, you’d be monitoring traffic and asking yourself why it is still checking for updates despite being disabled before deploying it to your entire IT infrastructure.

I see a lot of bad faith arguments here against CrowdStrike. I agree that they messed up, but it pales in comparison in my book to how messed up these companies are for not doing any basic planning around IT infrastructure & automation to be able to recover quickly.

Avid Amoeba@lemmy.ca · 4 months ago

Linux could have easily been bricked in a similar fashion by pushing a bad kernel or kernel module update that wasn’t tested enough. Not saying it’s the same as Windows, but this particular scenario where someone can push a system component just like that can fuck up both.

John Richard@lemmy.world · 4 months ago

Yes it can, but a kernel update is a completely different scenario, and managed individually by companies as part of their upgrades. It is usually tested and rolled out incrementally.

Furthermore, Linux doesn’t blue screen. I know some scenarios where Linux has issues, but I can count on one finger the amount of times I’ve had an update cause issues booting… and that was because I was using some newer encryption settings as part of systemd.

However, it would take all my fingers & toes, and then some, to count the number of blue screens I’ve gotten with Windows… and I don’t think I’m alone in that regard.

huginn@feddit.it · 4 months ago

And you’re running corporate kernel level security software on your encrypted Linux server?

John Richard@lemmy.world · edit-2 4 months ago

I guess it depends on what you consider corporate kernel level security. Would that include AppArmor, SELinux, and other tools that are open-source but used in some of the most secure corporate and government environments? Or are you asking if I’m running proprietary untrusted code on a Linux server with access to the system kernel?

ricecake@sh.itjust.works · 4 months ago

In this case, it’s really not a Linux/windows thing except by the most tenuous reasoning.

A corrupted piece of kernel level software is going to cause issues in any OS.
Cloudstrike itself has actually caused kernel panics on Linux before, albeit less because of a corrupted driver and more because of programming choices interacting with kernel behavior. (Two bugs: you shouldn’t have done that, and it shouldn’t have let you).

Tenuously, Linux is a better choice because it doesn’t need this type of software as much. It’s easier and more efficient to do packet inspection via dedicated firewall for infrastructure, and the other parts are already handled by automation and reporting tools you already use.
You still need something in this category if you need to solve the exact problem of “realtime network and filesystem event monitoring on each host”, but Linux makes it easier to get right up to that point without diving into the kernel.
Also vendors managing auto update is just less of a thing on Linux, so it’s more the cultural norm to manage updates in a way that’s conducive to staggering that would have caught this.

Contract wise, I’m less confident that crowd strike has favorable terms.
It’s usually consumers who are straddled with atrocious terms because they neither have power nor the interest in digging into the specifics too far.
Businesses, particularly ones that need or are interested in this category of software, inevitably have lawyers to go over contract terms in much more detail and much more ability to refuse terms and have it matter to the vendor. United airlines isn’t going to accept the contract terms of caveat emptor.

John Richard@lemmy.world · 4 months ago

You assume that businesses operate in good faith. That they thoroughly review contracts to ensure that they are fair and in the best interests of all its employees. Do you really think Greg, a VP of Cloud Solutions that makes 500k a year, who gets his IT advice on the golf course by AWS, Microsoft, & Oracle reps. Who gets wined & dined almost weekly by these reps, and a speaking spot at re:Invent, and believes Gartner when it says spending $5 million a month on cloud hosting and $90/TB on Egress traffic is normal, has the company’s best interests in mind?

I’ve seen companies pay millions for things they never used, or that weren’t ever provided by the vendor. You go to your managers, and say… “hey, why are we paying for this?” and suddenly you’re the bad guy. I’d love for you to prove me wrong. I’ve found pieces of progress before, within isolated teams when a manager wanted to actually accomplish something. It never lasts though… its like being an ice cube in a glass full of warm water.

SeattleRain@lemmy.world · 4 months ago

It’s the cyber 9/11 they always worried about.