Security experts said CrowdStrike’s routine update of its widely used cybersecurity software, which caused clients’ computer systems to crash globally on Friday, apparently did not undergo adequate quality checks before it was deployed. The latest version of its Falcon Sensor software was meant make CrowdStrike clients’ systems more secure against hacking by updating the threats it defends against. But faulty code in the update files resulted in one of the most widespread tech outages in recent years for companies using Microsoft’s Windows operating system.
Global banks, airlines, hospitals and government offices were disrupted. CrowdStrike released information to fix affected systems, but experts said getting them back online would take time as it required manually weeding out the flawed code. “What it looks like is, potentially, the vetting or the sandboxing they do when they look at code, maybe somehow this file was not included in that or slipped through,” said Steve Cobb, chief security officer at Security Scorecard, which also had some systems impacted by the issue.
Problems came to light quickly after the update was rolled out on Friday, and users posted pictures on social media of computers with blue screens displaying error messages. These are known in the industry as “blue screens of death.”
Patrick Wardle, a security researcher who specializes in studying threats against operating systems, said his analysis identified the code responsible for the outage. The update’s problem was “in a file that contains either configuration information or signatures,” he said. Such signatures are code that detects specific types of malicious code or malware. “It’s very common that security products update their signatures, like once a day… because they’re continually monitoring for new malware and because they want to make sure that their customers are protected from the latest threats,” he said. The frequency of updates “is probably the reason why (CrowdStrike) didn’t test it as much,” he said.
It’s unclear how that faulty code got into the update and why it wasn’t detected before being released to customers. “Ideally, this would have been rolled out to a limited pool first,” said John Hammond, principal security researcher at Huntress Labs. “That is a safer approach to avoid a big mess like this.” Other security companies have had similar episodes in the past. McAfee’s buggy antivirus update in 2010 stalled hundreds of thousands of computers.
But the global impact of this outage reflects CrowdStrike’s dominance. Over half of Fortune 500 companies and many government bodies such as the top U.S. cybersecurity agency itself, the Cybersecurity and Infrastructure Security Agency, use the company’s software.
Comment: Both my sons are in IT. This outage didn’t affect the systems they work on directly. But neither of them use the CrowdStrike Falcon Platform. My younger son often regales me of the lengths to which he and his team go to test out new software and updates before they are released into the company’s systems. They maintain a fairly large test system just for this purpose. The last thing he wants is to responsible for is releasing buggy software into the wild, especially software that he and his team wrote.
I should be surprised that CrowdStrike released a faulty upgrade into the wild without rigorous testing, but I’m not. They release upgrades so often, they clearly got complacent. Diddling with software, like these upgrades, is like playing with dynamite. I wonder if any of the coders warned management that more testing is needed. If Crowdstrike survives this screw up as a company, they should make a deal with several of their customers to act as guinea pigs for future software updates. Perhaps provide free service in exchange for risking an occasional BSOD.
TTG
The gross incompetence just keeps on coming. Crony capitalism and government work gets increasingly complacent due to lack of accountability. Crowdstrike is going to get sued and ruined. Good riddance.
“But the global impact of this outage reflects CrowdStrike’s …”
criminal negligence; fixed it for them.
We should all be happy the IT Press is telling us Biden is in great shape, er, sorry, wrong press corps. We should all be happy the IT press is telling us there are NO malicious actors working within Crowdstrike (Russia DNC hoax anyone) or other IT companies. It’s just “oppsie” sorry. They shutdown most of the IT systems on the planet and “oops, just a boo boo, so sorry?” I don’t buy that line. Let’s make a deal for free crap in the future for the monies you lost and damages you incurred? Maybe the DOJ will be happy to do a sue and settle deal (better get it done before you know who gets elected) but companies that lost millions in revenue and repetitional damage?
I am reminded of the scene in 12 O’Clock high where General Savage (Gregory Peck) chews the gate guard a new one for not checking his i.d. – a basic security function. This IT failure is certainly not a harmless little lax of security. This was a failure of risk management within Crowdstrike – and a failure of risk management at a majority of companies that suffered from this single point of failure. Crowdstrike certainly deserves to see its executive leadership fired – to include the CEO. But that would take someone with a sense of integrity and responsibility. Traits in short supply lately.
Glad to see I am not the only one who appreciates, actually prefers, movies from the 1940s and 50s.
Let me recommend one:
The Ghost and Mrs. Muir.
Background for that: Britain had all too many war windows after WW2, having to bring up their children without a husband or a realistic possibility of gaining one.
I’m sure this gave them something to think about.
Moving into the 1950s, let me again recommend this hilarious scene of Cary Grant:
https://youtu.be/o8ZL_BQUi8o
Grant is amazing!
For the band that was playing, this gives some information:
https://www.exroyalmarinesbandsmen.net/Indiscreet.htm
Fred,
“This was a failure of risk management within Crowdstrike ”
No.
You’ve joined the excuse makers for the inexcusable, eh?
There is software, some of it AI, that checks code for the kind of inexplicable script that was present in the Crowd Strike code.
The there is testing on isolated servers and computers.
NO one – Mean no one – just shrugs releases the blatant F-up in the Crowd Strike code into the wild.
But if burying your head in the sand about this, like the Trump assassination attempt, helps you feel better, go ahead, by all means. Lots of people being extremely disingenuous at this point; I guess hoping they can ride out the storm and screw those who don’t.
Eric,
That term shouldn’t be that hard to understand. Risk management, as in make sure the things your company is selling don’t crash the customers’ systems and stick you with a big bill. What part of “Crowdstrike certainly deserves to see its executive leadership fired ” did you miss?
Fred,
“This was a failure of risk management within Crowdstrike – and a failure of risk management at a majority of companies that suffered from this single point of failure”
Exactly. For a global update push like this, the risk management is to push in a staggered fashion. Crowdstrike has been doing too many unnecessary pushes, too often. They got burnt by their hubris.
The company is appropriately named.
Crowd Strike.
Was there a relationship between Crowdstrike and some narrative about Russian hacking the ’16 election? My work computer is still down, what a joke. The things we thought our enemies would do to us, hack us and attack our leaders we seem to do to ourselves due to sheer incompetence.
Of course the “theorist” might say too much incompetence seems unlikely, and perhaps they are on to something. Things are not always as they seem, but more often than not they are, and it “seems” that this is just the banality of end game economics. People dont care about their jobs in the same way, and it seems endemic.
Although it might be “fun” to imagine this outage was an elaborate cover to install some malicious software into the global IT infrastructure, but maybe I’ve seen too many batman and bond movies!
Things are NEVER as they seem. 😀
(Welcome to our “Brave New World”)
Why did I start humming this song a few minutes ago?
On the Bayou – Hank Williams.
https://youtu.be/xnKOVPXhlnE
Music is a good way to deflect that you are on the side of some really bad people that don’t care a hoot about you or the country; all because they messaged you that Orange Man bad! And socialism good!
On the BSOD, is there any reason it could not be replaced with the HAL eye shot from Kubrick’s film? Caption: “You’re screwed, Dave.”
Mark Logan,
I don’t see why not. It would probably just take replacing a file. I very much wanted to replace the DIA logo on our systems with a shot of my ass pressed against a piece of glass when I retired, but would have been the obvious suspect for unauthorized tampering with government systems. Plus, I couldn’t ask any of my friends to take a picture of my ass.
That would be nice!
But most of the time it is not possible. When a system crashes, most likely you no longer have graphics capability. The text in the BSOD is useful because it shows some hints why the system crashed.
TonyL,
I’ve seen some masterpieces in ASCII art in my FIDONet days.
TTG,
Oh yes. HAL eye shot ASCII art would be excellent!
I am a retired software developer and executive.
I developed and implemented very large, complex high performance systems. Several things are not surprising when in the hands of
20 and 30 something amateurs. Never implement a new release on Friday. You must be able to back out a release as effortlessly as possible; this often requires significant “back out” code which must be tested at least as rigoursly as the application.
Live by Windows, die by Windows. The Windows OS is really not suitable for large systems. Cheap, fast, good: Pick 2.
TV
This wasn’t a new release as such but a definitions update. The point of EDR software such as CS Falcon is to get the security updates out there as quickly as possible.
Windows is not perfect but has been successfully used in large systems for decades.
Crowdstrike agents were linked with kernel panics in Linux Red Hat some weeks back as well. No software is perfect.
Sure, mistakes were made, and I’m sure many in the IT will think about how to prevent such issues in the future and if something unanticipated does happen, how to resolve the issue as quickly as possible.
Dave Plummer (wrote the Task Manager) has a pretty good take on this issue
https://www.youtube.com/watch?v=wAzEJxOo1ts
My brother is a corporate IT guy. Was up 12 hours getting systems back. He says that there is no way to defer Falcon updates (at least, not yet). But if your machine was off until 0130 eastern you dodged the bullet.
I don’t see any way to remotely determine if this was a testing issue, configuration control issue, or deployment issue.
Should add I was in Milwaukee trying to get out on Delta. Flight sked for AM Sat, so figured we dodged that bullet. Wrong. 2300 Fri got text informing flight cancelled. Auto-resked on AM Sun flight. No way to contact Delta via web or app. Went online to see if I could get any intel, saw picture uploaded of phone on hold over 6 hours. Massive crowds in airport with reports CS there couldn’t help either.
So Sat afternoon got text that AM Sun flight is cancelled. Auto-resked for AM Mon. At that point said forget it; rented a car and drove 12 hours Sunday. We were able to talk to Delta agents at the airport Sunday and got them to refund the tickets.
From stories sounds like Delta problem wasn’t so much the crowdstrike issue so much, but after they were back up their crew scheduling system couldn’t handle the volume of transactions and the OA scheduling problem of getting crews optimally to the right place. Sort of like Southwest last year which also bit us in the rear (getting gun-shy about air travel now).