CrowdStrike Isn't the Real Problem

John Richard@lemmy.world · 4 months ago

CrowdStrike Isn't the Real Problem

kent_eh@lemmy.ca · 4 months ago

Bloated IT budgets?

Where do you work, and are they hiring?

irotsoma@lemmy.world · 4 months ago

The bloat isn’t for workers, otherwise there’d be enough people to go reboot the machines and fix the issue manually in a reasonable amount of time. It’s only for executives, managers, and contracts with kickbacks. In fact usually they buy software because it promises to cut the need for people and becomes an excuse for laying off or eliminating new hire positions.

GiveMemes@jlai.lu · 4 months ago

As the post was stating, they get bloated by relying on vendors rather than in-house IT/Security.

My grandfather works IT for my state government tho and it’s a pretty good gig according to him

breakingcups@lemmy.world · 4 months ago

Please, enlighten me how you’d remotely service a few thousand Bitlocker-locked machines, that won’t boot far enough to get an internet connection, with non-tech-savvy users behind them. Pray tell what common “basic hygiene” practices would’ve helped, especially with Crowdstrike reportedly ignoring and bypassing the rollout policies set by their customers.

Not saying the rest of your post is wrong, but this stood out as easily glossed over.

lazynooblet@lazysoci.al · 4 months ago

Autopilot, intune. Force restart device twice to get startup repair, choose factory reset, share LAPS admin password and let the workstation rebuild itself.

LrdThndr@lemmy.world · edit-2 4 months ago

A decade ago I worked for a regional chain of gyms with locations in 4 states.

I was in TN. When a system would go down in SC or NC, we originally had three options:

(The most common) have them put it in a box and ship it to me.
I go there and fix it (rare)
I walk them through fixing it over the phone (fuck my life)

I got sick of this. So I researched options and found an open source software solution called FOG. I ran a server in our office and had little optiplex 160s running a software client that I shipped to each club. Then each machine at each club was configured to PXE boot from the fog client.

The server contained images of every machine we commonly used. I could tell FOG which locations used which models, and it would keep the images cached on the client machines.

If everything was okay, it would chain the boot to the os on the machine. But I could flag a machine for reimage and at next boot, the machine would check in with the local FOG client via PXE and get a complete reimage from premade images on the fog server.

The corporate office was physically connected to one of the clubs, so I trialed the software at our adjacent club, and when it worked great, I rolled it out company wide. It was a massive success.

So yes, I could completely reimage a computer from hundreds of miles away by clicking a few checkboxes on my computer. Since it ran in PXE, the condition of the os didn’t matter at all. It never loaded the os when it was flagged for reimage. It would even join the computer to the domain and set up that locations printers and everything. All I had to tell the low-tech gymbro sales guy on the phone to do was reboot it.

This was free software. It saved us thousands in shipping fees alone. And brought our time to fix down from days to minutes.

There ARE options out there.

magikmw@lemm.ee · edit-2 4 months ago

This works great for stationary pcs and local servers, does nothing for public internet connected laptops in hands of users.

The only fix here is staggered and tested updates, and apparently this update bypassed even deffered update settings that crowdstrike themselves put into their software.

The only winning move here was to not use crowdstrike.

LrdThndr@lemmy.world · 4 months ago

Absolutely. 100%

But don’t let perfect be the enemy of good. A fix that gets you 40% of the way there is still 40% less work you have to do by hand. Not everything has to be a fix for all situations. There’s no such thing as a panacea.

magikmw@lemm.ee · 4 months ago

Sure. At the same time one needs to manage resources.

I was all in on laptop deployment automation. It cut down on a lot of human error issues and having inconsistent configuration popping up all the time.

But it needs constant supervision, even if not constant updates. More systems and solutions lead to neglect if not supplied well. So some “would be good to have” systems just never make the cut, because as overachieving I am, I’m also don’t want to think everything is taken care of when it clearly isn’t.

John Richard@lemmy.world · 4 months ago

You were all in, but was the company all in? How many employees? It sounds like you innovated. Let’s say that the company you worked for was spending millions on vendors that promised solutions but rarely delivered. If instead they gave you $400k a year, a $1 million/year budget & 10 employees… I’m guessing you could have managed the laptop deployment automation, along with some other significant projects as well.

Instead though, people with good ideas, even loyal to the company, are competing against sales and marketing reps from billion dollar companies, and upper management are easily swooned.

wizardbeard@lemmy.dbzer0.com · 4 months ago

It also assumes that reimaging is always an option.

Yes, every company should have networked storage enforced specifically for issues like this, so no user data would be lost, but there’s often a gap between should and “has been able to find the time and get the required business side buy in to make it happen”.

Also, users constantly find new ways to do non-standard, non-supported things with business critical data.

Bluetreefrog@lemmy.world · 4 months ago

Isn’t this just more of what caused the problem in the first place? Namely, centralisation. If you store data locally and you lose a machine, that’s bad but not the end of the world. If you store it centrally and you lose the data, that’s catastrophic. Nassim Taleb nailed this stuff. Keep the downside limited, and the upside unlimited or as he says, “Don’t pick up pennies in front of a steamroller.”

John Richard@lemmy.world · 4 months ago

Almost all computers can be set to PXE boot, but work laptops usually even have more advanced remote management capabilities. You ask the employee to reboot the laptop and presto!

magikmw@lemm.ee · 4 months ago

I wonder how you’re supposed to get PXE boot to work securely over the internet. And how that helps when affected disk is still encrypted and needs unusual intervention to fix, including admin access to system files.

I’ve been doing this for a while, and I like creative solutions, so I wonder about those issues a lot. Not much comes to my mind besides let’s recall all the laptops and do it one by one.

John Richard@lemmy.world · 4 months ago

I wonder how you’re supposed to get PXE boot to work securely over the internet.

PXE boot is more of last resort IMO, but can be uses as a chainloader to a more secure option. The biggest challenge I could see security-wise is having PXE boot being ran on unsecured networks. Even then though, normally a computer will have been provisioned on a secure network and will have encryption and secure boot-based encryption, and some additional signature-based image verification.

Evotech@lemmy.world · edit-2 4 months ago

Now your fog servers are dead. What now

yeehaw@lemmy.ca · 4 months ago

This is a good solution for these types of scenarios. Doesn’t fit all though. Where I work, 85% of staff work from home. We largely use SaaS. I’m struggling to think of a good method here other than walking them through reinstalling windows on all their machines.

John Richard@lemmy.world · edit-2 4 months ago

Configure PXE to reboot into recovery image, push out command to remove bad file. Reboot. Done. Workstation laptops usually have remote management already.

or

Have recovery image already installed. Have user reboot & push key to boot into recovery. Push out fix. Done.

Brkdncr@lemmy.world · 4 months ago

How removed from IT are that you think fog would have helped here?

LrdThndr@lemmy.world · edit-2 4 months ago

How would it not have? You got an office or field offices?

“Bring your computer by and plug it in over there.” And flag it for reimage. Yeah. It’s gonna be slow, since you have 200 of the damn things running at once, but you really want to go and manually touch every computer in your org?

The damn thing’s even boot looping, so you don’t even have to reboot it.

I’m sure the user saved all their data in one drive like they were supposed to, right?

I get it, it’s not a 100% fix rate. And it’s a bit of a callous answer to their data. And I don’t even know if the project is still being maintained.

But the post I replied to was lamenting the lack of an option to remotely fix unbootable machines. This was an option to remotely fix nonbootable machines. No need to be a jerk about it.

But to actually answer your question and be transparent, I’ve been doing Linux devops for 10 years now. I haven’t touched a windows server since the days of the gymbros. I DID say it’s been a decade.

Brkdncr@lemmy.world · 4 months ago

Because your imaging environment would also be down. And you’re still touching each machine and bringing users into the office.

Or your imaging process over the wan takes 3 hours since it’s dynamically installing apps and updates and not a static “gold” image. Imaging is then even slower because your source disk is only ssd and imaging slows down once you get 10+ going at once.

I’m being rude because I see a lot of armchair sysadmins that don’t seem to understand the scale of the crowdstike outage, what crowdstrike even is beyond antivirus, and the workflow needed to recover from it.

John Richard@lemmy.world · 4 months ago

Imaging environment down? If a sysadmin can’t figure out how to boot a machine into recovery to remove the bad update file then they have bigger problems. The fix in this instance wasn’t even re-imaging machines. It was merely removing a file. Ideal DR scenario would have a recovery image already on the system that can be booted into remotely, so there is minimal strain on the network. Furthermore, we don’t live in dial-up age anymore.

John Richard@lemmy.world · 4 months ago

Thank you for sharing this. This is what I’m talking about. Larger companies not utilizing something like this already are dysfunctional. There are no excuses for why it would take them days, weeks or longer.

ramble81@lemm.ee · 4 months ago

You’d have to have something even lower level like a OOB KVM on every workstation which would be stupid expensive for the ROI, or something at the UEFI layer that could potentially introduce more security holes.

circuscritic@lemmy.ca · edit-2 4 months ago

…you don’t have OOBM on every single networked device and terminal? Have you never heard of the buddy system?

You should probably start writing up an RFP. I’d suggest you also consider doubling up on the company issued phones per user.

If they already have an ATT phone, get them a Verizon one as well, or vice versa.

At my company we’re already way past that. We’re actually starting to import workers to provide human OOBM.

You don’t answer my call? I’ll just text the migrant worker we chained to your leg to flick your ear until you pick up.

Maybe that sounds extreme, but guess who’s company wasn’t impacted by the Crowdstrike outage.

Leeks@lemmy.world · 4 months ago

Maybe they should offer a real time patcher for the security vulnerabilities in the OOB KVM, I know a great vulnerability database offered by a company that does this for a lot of systems world wide! /s

John Richard@lemmy.world · 4 months ago

UEFI isn’t going away. Sorry to break the news to you.

ramble81@lemm.ee · 4 months ago

I didn’t say it was, nor did I say UEFI was the problem. My point was additional applications or extensions at the UEFI layer increase the attack footprint of a system. Just like vPro, you’re giving hackers a method that can compromise a system below the OS. And add that in to laptops and computers that get plugged in random places before VPNs and other security software is loaded and you have a nice recipe for hidden spyware and such.

mynamesnotrick@lemmy.zip · edit-2 4 months ago

Was a windows sysadmin for a decade. We had thousands of machines with endpoint management with bitlocker encryption. (I have sincd moved on to more of into cloud kubertlnetes devops) Anything on a remote endpoint doesn’t have any basic “hygiene” solution that could remotely fix this mess automatically. I guess Intels bios remote connection (forget the name) could in theory allow at least some poor tech to remote in given there is internet connection and the company paid the xhorbant price.

All that to say, anything with end-user machines that don’t allow it to boot is a nightmare. And since bit locker it’s even more complicated. (Hope your bitloxker key synced… Lol).

Spuddlesv2@lemmy.ca · 4 months ago

You’re thinking of Intel vPro. I imagine some of the Crowdstrike ~~victims~~ customers have this and a bunch of poor level 1 techs are slowly griding their way through every workstation on their networks. But yeah, OP is deluded and/or very inexperienced if they think this could have been mitigated on workstations through some magical “hygiene”.

LrdThndr@lemmy.world · 4 months ago

Bro. PXE boot image servers. You can remotely image machines from hundreds of miles away with a few clicks and all it takes on the other end is a reboot.

wizardbeard@lemmy.dbzer0.com · 4 months ago

With a few clicks and being connected to the company network. Leaving anyone not able to reach an office location SOL.

LrdThndr@lemmy.world · 4 months ago

Hey, it’s not perfect, but a fix that gets you 10% of the way there is still 10% you don’t have to do by hand. Don’t let perfect be the enemy of good, my man.

Riskable@programming.dev · edit-2 4 months ago

what common “basic hygiene” practices would’ve helped

Not using a proprietary, unvetted, auto-updating, 3rd party kernel module in essential systems would be a good start.

Back in the day companies used to insist upon access to the source code for such things along with regular 3rd party code audits but these days companies are cheap and lazy and don’t care as much. They’d rather just invest in “security incident insurance” and hope for the best 🤷

Sometimes they don’t even go that far and instead just insist upon useless indemnification clauses in software licenses. …and yes, they’re useless:

https://www.nolo.com/legal-encyclopedia/indemnification-provisions-contracts.html#:~:text=Courts have commonly held that,knowledge of the relevant circumstances).

(Important part indicating why they’re useless should be highlighted)

JasonDJ@lemmy.zip · 4 months ago

Does Windows have a solid native way to remotely re-image a system like macOS does?

catloaf@lemm.ee · 4 months ago

No.

Maybe with Intune and Autopilot, but I haven’t used it.

lazynooblet@lazysoci.al · 4 months ago

If you don’t know, don’t answer

John Richard@lemmy.world · 4 months ago

Windows ADK does this too, or any PXE server really… so yes, you can. The CS issue though didn’t require re-image. Merely removing a file. DR planning would usually have a recovery image pre-installed to automate booting into for lower-level fixes.

Dran@lemmy.world · edit-2 4 months ago

Separate persistent data and operating system partitions, ensure that every local network has small pxe servers, vpned (wireguard, etc) to a cdn with your base OS deployment images, that validate images based on CA and checksum before delivering, and give every user the ability to pxe boot and redeploy the non-data partition.

Bitlocker keys for the OS partition are irrelevant because nothing of value is stored on the OS partition, and keys for the data partition can be stored and passed via AD after the redeploy. If someone somehow deploys an image that isn’t ours, it won’t have keys to the data partition because it won’t have a trust relationship with AD.

(This is actually what I do at work)

I_Miss_Daniel@lemmy.world · 4 months ago

Sounds good, but can you trust an OS partition not to store things in %programdata% etc that should be encrypted?

Dran@lemmy.world · 4 months ago

With enough _autism in your overlay configs, sure, but in my environment tat leakage is still encrypted. It’s far simpler to just accept leakage and encrypt the OS partition with a key that’s never stored anywhere. If it gets lost, you rebuild the system from pxe. (Which is fine, because it only takes about 20 minutes and no data we care about exists there) If it’s working correctly, the OS partition is still encrypted and protects any inadvertent data leakage from offline attacks.

AnAmericanPotato@programming.dev · 4 months ago

This doesn’t seem to be a problem with disaster recovery plans. It is perfectly reasonable for disaster recovery to take several hours, or even days. As far as DR goes, this was easy. It did not generally require rebuilding systems from backups.

In a sane world, no single party would even have the technical capability of causing a global disaster like this. But executives have been tripping over themselves for the past decade to outsource all their shit to centralized third parties so they can lay off expensive IT staff. They have no control over their infrastructure, their data, or, by extension, their business.

r00ty@kbin.life · 4 months ago

I think it’s most likely a little of both. It seems like the fact most systems failed at around the same time suggests that this was the default automatic upgrade /deployment option.

So, for sure the default option should have had upgrades staggered within an organisation. But at the same time organisations should have been ensuring they aren’t upgrading everything at once.

As it is, the way the upgrade was deployed made the software a single point of failure that completely negated redundancies and in many cases hobbled disaster recovery plans.

DesertCreosote@lemm.ee · 4 months ago

Speaking as someone who manages CrowdStrike in my company, we do stagger updates and turn off all the automatic things we can.

This channel file update wasn’t something we can turn off or control. It’s handled by CrowdStrike themselves, and we confirmed that in discussions with our TAM and account manager at CrowdStrike while we were working on remediation.

daddy32@lemmy.world · 4 months ago

There was a “hack” mentioned in another thread - you can block it via firewall and then selectively open it.

r00ty@kbin.life · 4 months ago

That’s interesting. We use crowdstrike, but I’m not in IT so don’t know about the configuration. Is a channel file, somehow similar to AV definitions? That would make sense, and I guess means this was a bug in the crowdstrike code in parsing the file somehow?

DesertCreosote@lemm.ee · 4 months ago

Yes, CrowdStrike says they don’t need to do conventional AV definitions updates, but the channel file updates sure seem similar to me.

The file they pushed out consisted of all zeroes, which somehow corrupted their agent and caused the BSOD. I wasn’t on the meeting where they explained how this happened to my company; I was one of the people woken up to deal with the initial issue, and they explained this later to the rest of my team and our leadership while I was catching up on missed sleep.

I would have expected their agent to ignore invalid updates, which would have prevented this whole thing, but this isn’t the first time I’ve seen examples of bad QA and/or their engineering making assumptions about how things will work. For the amount of money they charge, their product is frustratingly incomplete. And asking them to fix things results in them asking you to submit your request to their Ideas Portal, so the entire world can vote on whether it’s a good idea, and if enough people vote for it they will “consider” doing it. My company spends a fortune on their tool every year, and we haven’t been able to even get them to allow non-case-sensitive searching, or searching for a list of hosts instead of individuals.

r00ty@kbin.life · 4 months ago

Thanks. That explains a lot of what I didn’t think was right regarding the almost simultaneous failures.

I don’t write kernel code at all for a living. But, I do understand the rationale behind it, and it seems to me this doesn’t fit that expectation. Now, it’s a lot of hypothetical. But if I were writing this software, any processing of these files would happen in userspace. This would mean that any rejection of bad/badly formatted data, or indeed if it managed to crash the processor it would just be an app crash.

The general rule I’ve always heard is that you want to keep the minimum required work in the kernel code. So I think processing/rejection should have been happening in userspace (and perhaps even using code written in a higher level language with better memory protections etc) and then a parsed and validated set of data would be passed to the kernel code for actioning.

But, I admit I’m observing from the outside, and it could be nothing like this. But, on the face of it, it does seem to me like they were processing too much in the kernel code.

technocrit@lemmy.dbzer0.com · edit-2 4 months ago

An underlying problem is that legal security is mostly security theatre. Legal security provides legal cover for entities without much actual security.

The point of legal security is not to protect privacy, users, etc., but to protect the liability of legal entities when the inevitable happens.

neglecting the due diligence necessary to ensure those solutions truly fit their needs.

CrowdStrike perfectly met their needs by proving someone else to blame. I don’t think anybody is facing any consequences for contracting with CrowdStrike. It’s the same deal with Microsoft X 10000000. These bad incentives are the whole point of the system.

Leeks@lemmy.world · 4 months ago

bloated IT budgets

Can you point me to one of these companies?

In general IT is run as a “cost center” which means they have to scratch and save everywhere they can. Every IT department I have seen is under staffed and spread too thin. Also, since it is viewed as a cost, getting all teams to sit down and make DR plans (since these involve the entire company, not just IT) is near impossible since “we may spend a lot of time and money on a plan we never need”.

John Richard@lemmy.world · 4 months ago

With most corporations, especially Fortune 500s… audit their budgets. The problem doesn’t start with IT. but with bad management from top down. This “cost center” you speak of is mostly what I’d expect to hear do-nothing middle-level managers tell their in-house employees when asking for a raise.

Leeks@lemmy.world · 4 months ago

It feels like you have an agenda that you are trying to apply to the CrowdStrike event and just so happen to slandering IT as an innocent bystander to the agenda you are putting forward.

If you had to summarize the goal of your initial post in less then 10 words, what would it be?

John Richard@lemmy.world · 4 months ago

Worked many high-level corp IT. Problem is them, not CrowdStrike.

Rhaedas@fedia.io · 4 months ago

I don’t think it’s that uncommon an opinion. An even simpler version is the constant repeats over years now of information breaches, often because of inferior protect. As a amateur website creator decades ago I learned that plain text passwords was a big no-no, so how are corporation ITs still doing it? Even the non-tech person on the street rolls their eyes at such news, and yet it continues. CrowdStrike is just a more complicated version of the same thing.

TechNerdWizard42@lemmy.world · 4 months ago

Issue is definitely corporate greed outsourcing issues to a mega monolith IT company.

Most IT departments are idiots now. Even 15 years ago, those were the smartest nerds in most buildings. They had to know how to do it all. Now it’s just installing the corporate overlord software and the bullshit spyware. When something goes wrong, you call the vendor’s support line. That’s not IT, you’ve just outsourced all your brains to a monolith that can go at any time.

None of my servers running windows went down. None of my infrastructure. None of the infrastructure I manage as side hustles.

Lettuce eat lettuce@lemmy.ml · edit-2 4 months ago

I’ve seen the same thing. IT departments are less and less interested in building and maintaining in-house solutions.

I get why, it requires more time, effort, money, and experienced staff to pay.

But you gain more robust systems when it’s done well. Companies want to cut costs everywhere they can, and it’s cheaper to just pay an outside company to do XY&Z for you and just hire an MSP to manage your web portals for it, or maybe a 2-3 internal sys admins that are expected to do all that plus level 1 help desk support.

Same thing has happened with end users. We spent so much time trying to make computers “friendly” to people, that we actually just made people computer illiterate.

I find myself in a strange place where I am having to help Boomers, older Gen-X, and Gen-Z with incredibly basic computer functions.

Things like:

Changing their passwords when the policy requires it.
Showing people where the Start menu is and how to search for programs there.
How to pin a shortcut to their task bar.
How to snap windows to half the screen.
How to un-mute their volume.
How to change their audio device in Teams or Zoom from their speakers to their headphones.
How to log out of their account and log back in.
How to move files between folders.
How to download attachments from emails.
How to attach files in an email.
How to create and organize Browser shortcuts.
How to open a hyperlink in a document.
How to play an audio or video file in an email.
How to expand a basic folder structure in a file tree.
How to press buttons on their desk phone to hear voicemails.

It’s like only older Millennials and younger gen-X seem to have a general understanding of basic computer usage.

Much of this stuff has been the same for literally 30+ years. The Start menu, folders, voicemail, email, hyperlinks, browser bookmarks, etc. The coat of paint changes every 5-7 years, but almost all the same principles are identical.

Can you imagine people not knowing how to put a car in drive, turn on the windshield wipers, or fill it with petrol, just because every 5-7 years the body style changes a little?

ocassionallyaduck@lemmy.world · 4 months ago

Man, as someone who’s cross discipline in my former companies, the way people treat It, and the way the company considers IT as an afterthought is just insane. The technical debt is piled high.

istanbullu@lemmy.ml · 4 months ago

The real problem is the monopolization of IT and the Cloud.

viking@infosec.pub · 4 months ago

Is there a way to remotely boot into network activated recovery mode? Genuine question, I never looked into it.

lud@lemm.ee · 4 months ago

For physical servers there are out of band management systems like Dell DRAC that allows you to manage the server even when the OS is broken or non existent.

For clients there are systems like Intel vPRO and AMD AMT. I have not used either of them but they apparently work similarly to the systems used on servers.

viking@infosec.pub · 4 months ago

Ah neat, I’ll look those up. Thanks a lot!

NarrativeBear@lemmy.world · 4 months ago

A expensive kvm card, or Pikvm for the home server.

daddy32@lemmy.world · 4 months ago

At least for virtual servers, There has to be a cheaper software equivalent, as my cheap VPS allows this (via vnc) with no issues.

computergeek125@lemmy.world · edit-2 4 months ago

Virtual servers (as opposed to hardware workstations or servers) will usually have their “KVM” (Keyboard Video Mouse) built in to the hypervisor control plane. ESXi, Proxmox (KVM - Kernel Virtual Machine), XCP-ng/Citrix XenServer (Xen), Nutanix (KVM-like), and many others all provide access to this. It all comes down to what’s configured on the hypervisor OS.

VMs are easy because the video and control feeds are software constructs so you can just hook into what’s already there. Hardware (especially workstations) are harder because you don’t always have a chip on the motherboard that can tap that data. Servers usually have a dedicated co-computer soldered onto the motherboard to do this, but if there’s nothing nailed down to do it, your remote access is limited to what you can plug in. PiKVM is one such plug-in option.

daddy32@lemmy.world · 4 months ago

Thank you for the explanation, I really appreciate it. Bystanders will probably too :)

edric@lemm.ee · edit-2 4 months ago

For sure there is a problem, but this issue caused computers to not be able to boot in the first place, so how are you gonna remotely reboot them if you can’t connect to them in the first place? Sure there can be a way like one other comment explained, but it’s so complicated and expensive that not all of even the biggest corporations do them.

Contrary to what a lot of people seem to think, CrowdStrike is pretty effective at what it does, that’s why they are big in the corporate IT world. I’ve worked with companies where the security team had a minority influence on choosing vendors, with the finance team being the major decision maker. So cheapest vendor wins, and CrowdStrike is not exactly cheap. If you ask most IT people, their experience is the opposite of bloated budgets. A lot of IT teams are understaffed and do not have the necessary tools to do their work. Teams have to beg every budget season.

The failure here is hygiene yes, but in development testing processes. Something that wasn’t thoroughly tested got pushed into production and released. And that applies to both Crowdstrike and their customers. That is not uncommon (hence the programmer memes), it just happened to be one of the most prevalent endpoint security solutions in the world that needed kernel level access to do its job. I agree with you in that IT departments should be testing software updates before they deploy, so it’s also on them to make sure they at least ran it in a staging environment first. But again, this is a tool that is time critical (anti-malware) and companies need to have the capability to deploy updates fast. So you have to weigh speed vs reliability.

John Richard@lemmy.world · 4 months ago

Booting a system or recovery image remotely over an IPMI or similar interface is not complicated or expensive. It is one of the most basic server management tasks. You acting like the concept is challenging seriously concerns me and I seriously wonder how anyone that thinks like that gets hired.

There are exceptions, granted. However, the IT budget at most mid to large-size corporations is extremely bloated. I don’t think you can in good faith argue otherwise, unless you want to show me a budget that isn’t. Do you have a real one that you can provide?

These companies don’t even attract smart talent. They attract people that are complacent with doing nothing & collecting a paycheck. Smart people do not continue to work at these companies. The bureaucracy and management is soul-sucking. It took me a while to accept it too. I used to be optimistic thinking there is a logical explanation that can be fixed. Turns out they don’t want to be fixed. They like to be broken. Like I said, it starts from the top down. A lot of the staff wouldn’t even have a job if people actually tried to make things better.

edric@lemm.ee · edit-2 4 months ago

It is one of the most basic server management tasks.

Except these were endpoint machines, not servers. Things grinded to a halt not because servers went down, but because the computers end users interacted with crashed and wouldn’t boot, kiosk and POS systems included.

You acting like the concept is challenging seriously concerns me and I seriously wonder how anyone that thinks like that gets hired.

Damn, I guess all the IT people running the systems that were affected aren’t fit for the job.

unless you want to show me a budget that isn’t. Do you have a real one that you can provide?

Can YOU show me the bloated budgets and where they are allocated on those mid to large size corporations? You are the one who insinuated that. All I said is that my experience for all the companies I worked with is that we always had to fight hard for budget, because the sales and marketing departments bring in the $$$ and that’s only what the executives like to see, therefore they get the budget. If your entire working experience is that your IT team had too much budget, then consider yourself privileged.

It’s weird how you’re all defensive and devolve to insults when people are just responding to your post.

John Richard@lemmy.world · 4 months ago

Except these were endpoint machines, not servers. Things grinded to a halt not because servers went down, but because the computers end users interacted with crashed and wouldn’t boot, kiosk and POS systems included.

Endpoint machines still have IPMI type of interfaces and PXE. When you manage thousands of machines, if you treat them all like a pet then you’re doing it wrong.

Damn, I guess all the IT people running the systems that were affected aren’t fit for the job.

Is it going to take them several days to weeks to recover? Then they aren’t fit for the job, or should consider another profession.

Can you show me the bloated budgets and where they are allocated on those mid to large size corporations?

All of them. The Form 10k fillings are available for public corporations. The ones claiming that they will be impacted for a while are the ones I’m concerned most about.

It’s weird how you’re all defensive and devolve to insults when people are just responding to your post.

I spent a career arguing with sales reps who had one goal in mind, and that was to make the biggest commission possible. I sound argumentative because those sales reps had every tool imaginable to show up out of no where.

SparrowRanjitScaur@lemmy.world · 4 months ago

C++ is the problem. C++ is an unsafe language that should definitely not be used for kernel space code in 2024.

John Richard@lemmy.world · 4 months ago

Let’s rewrite everything in Rust. That’ll surely solve the world’s problems.

FaceDeer@fedia.io · 4 months ago

particularly for companies entrusted with vast amounts of sensitive personal information.

I nodded along to most of your comment but this cast a discordant and jarring tone over it. Why particularly those companies? The CrowdStrike failure didn’t actually result in sensitive information being deleted or revealed, it just caused computers to shut down entirely. Throwing that in there as an area of particular concern seems clickbaity.

John Richard@lemmy.world · 4 months ago

It was to elaborate that there is a bigger issue here with corporate IT culture that is broken. The CrowdStrike incident merely exposes it, but CrowdStrike isn’t the real problem. Remediation for an event like this, especially once the fix is known, should be 30 minutes… not weeks or months.

RaoulDook@lemmy.world · 4 months ago

The OS should be mature enough by now that it could automatically recover from crashing on the load of a bad 3rd party driver. But it was not, wtf.

John Richard@lemmy.world · 4 months ago

Microsoft has been too busy building a new Outlook PWA with ads in your email, and AI laptops that capture screenshots of your desktop in unencrypted folders.

catloaf@lemm.ee · 4 months ago

It can, sort of. Safe mode will still boot just fine. But then what should it do? Just blacklist the driver and reboot? That’s not going to work too well if it’s the storage driver.

RaoulDook@lemmy.world · 4 months ago

Well they could still just blacklist all 3rd party drivers except storage drivers. Many categories of 3rd party drivers could be excluded fully during a selective recovery boot process.