UnHack (the News): CrowdStrike Incident, Lessons Learned and Recovery Plans with Russell Teague

July 24, 2024: Russell Teague, CISO at Fortified Health Security, joins Drex to dissect the recent global impact of a CrowdStrike update gone wrong. How do businesses effectively prepare for unexpected system failures? What lessons can be gleaned from the rapid response and recovery efforts witnessed during this crisis? The discussion delves into the intricacies of business continuity, disaster recovery planning, and the interconnected nature of modern cybersecurity. They explore the real-world implications for healthcare systems and the broader impacts on industries worldwide. How do organizations prioritize recovery efforts in the face of widespread outages? How can they leverage these experiences to bolster future resilience?

Key Points:

01:03 CrowdStrike Update Incident Overview
03:18 Impact on Healthcare and Other Sectors
05:11 Recovery Efforts and Challenges
09:05 Lessons Learned and Future Preparedness
20:50 Conclusion and Final Thoughts

News articles:

This Week Health Subscribe

This Week Health Twitter

This Week Health Linkedin

Alex’s Lemonade Stand: Foundation for Childhood Cancer Donate

Transcript

This transcription is provided by artificial intelligence. We believe in technology but understand that even the smartest robots can sometimes get speech recognition wrong.

Thanks to our partner Fortified Health Security. No matter where you're at in your cybersecurity journey, Fortified can help you improve your cybersecurity posture through their 24 7 threat defense services or advisory services delivered through Central Command, a first of its kind platform that simplifies cybersecurity management and provides the visibility you need to mature your program.

Learn more at fortifiedhealthsecurity. com. Today on Unhack the News.

(Intro) Most people will hopefully document this as a real test of their business continuity and disaster recovery plans. Turn a negative into a positive by using it as a valid test and a valid instance, and then let's improve upon them and be better prepared because preparedness is the only thing that we have on our side to be ready for the next one.

Hi, I'm Drex DeFord, a recovering healthcare CIO and long time cyber advisor and strategist for some of the world's most innovative cybersecurity companies. Now I'm president of this week Health's 229 Cyber and Risk Community, and this is Unhack the News, a mostly plain English, mostly non technical show covering the latest and most important security news stories. And now, this episode of Unhack the News.

(Main) hey everyone. Welcome to Unhack the News. I'm with Russell Teague at Fortified, the CISO over there, who I think probably had a really crazy weekend this weekend. How are you doing?

Yeah, absolutely. Doing good. Thank you, Drex. And yeah, it's been a very interesting weekend started on Friday, obviously, as you well know, following the release of the CrowdStrike content update.

That came in came in on Friday morning.

And so there's a lot of it's basically completely owned the news cycle from Friday on. There's tons of stories to talk about. We can call out some of them individually if you'd like, but there, there are lots of stories talking about what happened, but you're out there, you're in the field all the time.

What happened?

Yeah, so when you boil it down, CrowdStrike pushed an agent update to the Windows agent, so part of the CrowdStrike Falcon platform which is a leading cybersecurity endpoint detection tool widely used globally across all industries and they push updates regularly, right?

Think about it as a signature update or a simple software update that's being pushed out. To the endpoint agent. Mostly benign, right? Really shouldn't necessarily impact the operations. It usually gets pushed out late night. But for whatever reason, this push created a conflict with the Windows operating system.

Predominantly, this was a push to the Windows agent, so it only impacted Windows operating systems, but it created a conflict which created something known as a blue screen of death. So, it just errors out the operating system. It goes to an error state, which presents a blue screen to the end user.

And so, a simple reboot. did not resolve it. It would just bring back the blue screen over and over again. So you were in an endless loop. Now, CrowdStrike identified this error relatively quickly. Very impressive on their part to identify the error. And, within an hour, they had a patch out available.

They removed that update and needed to roll to a an update that didn't create the conflict. The challenge there in lies with, okay, now you've got. All of these global machines, somewhere to the tune of about 8. 5 million Windows devices. Now, very small portion of the world's Windows devices, but still a very large population.

They all had to be manually touched. From a healthcare perspective, when you've got say there's a thousand, endpoints within organization and you need to physically touch each one of those and go through a couple of boot cycles, up, remove the driver, update it, reboot it again.

Very manual process to recover from it. Yeah, so you're talking, 20 to 30 minutes probably per machine, First cycle, and no way to automate that necessarily, right? You physically need to put hands on keyboard and go through that cycle. So very resource intensive.

That

was

a really interesting part of it too, right?

I read a lot of stuff. And I hate to say I read a lot of stuff from people who don't know what they're talking about, but let's just say that. Who talk about like, why couldn't they just roll back? Or why isn't, wasn't there just a rollback plan? What do you say to people who have made that assertion?

There were some instances like BitLocker as an example. Those that had BitLocker running on their machines could roll back to a previously known good state. So there are situations where certain organizations that had certain, Other technologies deployed that did make it a little easier for them to roll back and their recovery was a little faster.

Even then though, didn't you have to go touch the machine? You had to go actually do something to the machine. It just was like, that was the thing that I think really killed everyone was just that I have to physically go touch every device.

Yeah. And, really worked hard to try to be very transparent and communicative around it, and also let other people know that this wasn't a security event.

Their platforms are still monitoring, but you just had to get back to that known good state, which means. It was very resource intensive, very time intensive to be able to, and most organizations, especially in the healthcare organizations, most organizations don't have that kind of workforce to be able to go out and touch all of them.

So, here we are 4 days later, and many organizations are still dealing with it. I had a friend traveling through the Denver airport, and all the screens are still blue in the Denver airport this morning. And so, as an example. This was a global impact, not only impacting healthcare organizations, banking, finance, transportation, 911, as well as healthcare.

So, so on the healthcare side, what are you hearing out there? I've heard from a lot of my friends, a lot of members of our community telling me, about their situation, but what are you hearing?

same, it's mixed. Some, obviously, that were running different EDR platforms, S1 Cyber Reason, Microsoft, Defender, weren't impacted by it.

Those that are running CrowdStrike are still dealing with the recovery. Several of my specific clients had, six, seven, eight hundred thousand, fifteen hundred machines they had to touch most of them were down to, under, three, four hundred machines by end of day Friday.

So they were recovering well, but the demand that they had to put in was significant. The other unique thing that I think most people didn't consider was those that weren't even operating on CrowdStrike, they were operating on another EDR platform, were seeing secondary impacts, dragging. Speak as an example, was using CrowdStrike and was significantly impacted.

We saw payroll implications. We saw other processing implications coming from 3rd and 4th party where the organization itself. Wasn't impacted directly by their endpoints, but critical business services that were delivered by third party service providers, and those service providers were unable to service their customers, and we saw secondary impacts like that.

Yeah, in the spirit of, and you and I think talked about this in the last episode, that everything is connected to everything else idea, that we do live in this really strange world now. I don't know if it's strange, but it's just the world that we've created where one, two, Domino gets tipped over, and even though you're completely clean in your organization, you're connected to partners who wind up having problems, which obviously affects how you can deliver care or not.

Yeah, 100%. There's a number of news articles talking about, Cleveland Clinic, Mayo Clinic, Barnabas so many others that were all impacted and had to, non elective, non critical services were all shut down, right?

They just couldn't take it because they either couldn't process them into the ED, or they couldn't handle some of the ambulatory, and they were still busy rebooting a lot of their systems or getting access so they could do you know, I think inpatient care, the downtime procedures were sufficient enough to keep operations going, but, ingesting new patients or taking, non critical, non essential care was all slowed or stopped.

So. I noticed that you were quoted in HIPAA journal. I think I reposted that article today, a good one talk a little bit about what you talked about there.

Yeah, they picked up a number of things, playing on the connected world aspect of that, as well as the business continuity aspects of it having, documented trained and, tested Backup and business continuity plans, right?

Having your downtime procedures documented, having them tested and having your individuals trained on how to deploy them, right? And this is a perfect instance where an unexpected update caused a global outage and forced many of the healthcare organizations into downtime procedures, and so they had to go manual in some cases.

They had to rely on manual turnover reports, because maybe they weren't able to access using their floor mobile carts to get to the EMR system, and so they had to have manual turnover sheets. Everything had to be documented. Obviously, they had to scale up in the nurses and things like that to make sure that they had proper patient coverage, but These downtime procedures are critical for healthcare.

I feel like if we've learned nothing else in the last year, between change, between the CrowdStrike thing, in the spirit of everything's connected to everything else, there were other big outages that affected other partners, this whole D idea of just.

Business continuity and business continuity planning and running those exercises on business continuity. And the reality, I think too, that you and I talk about regularly, that the three hour downtime plan is not the same thing as the three day or, 30 day downtime plan. Like these are different things and you have to do, you can't just say take the three hour downtime plan and just keep doing that for 30 days.

So you guys do a lot of work. helping folks figure this out, right?

Yeah, we do. We've seen a significant increase in the inquiry around both third party business impact analysis. So looking at critical business services provided by third parties and what the recovery time and recovery point objectives are associated with that.

So using the old Business Continuity Disaster Recovery process to say let's do some business impact analysis or business impact assessments to understand those implications but putting in the critical business third party aspect into that Traditionally, it's around your core services your collaborations in your core business processes we've now expanded that into looking at, third parties that are providing because We are a hybrid environment today where we're using more cloud based services, more software as a service providers, and so we need to blend them into that aspect, and we're really looking at the downtime procedures, to your point and when were they last tested, how long can you operate on them, what does that look like over minutes, days, weeks, even potentially even months on some of these major ransom cases where you see major outages for an extended period of time.

Yeah. So, that's one lesson. I'm sure we're going to get tons of these as this continues to unfold. Other lessons that you're already looking at or thinking of that come out of this that you're coaching and teaching customers and the community on.

Yeah, definitely. Just looking at the broader program scope as a whole looking at the ecosystem, understanding the implications because we do live in the so interconnected world it's not on the four walls of your own organization. Now, it really is that hybrid environment looking beyond looking out into into your cloud providers.

So we've seen an increase in third party risk management, we've seen an increase in that business continuity, disaster recovery in the business but then just stepping back and rethinking your entire program and rethinking around the threat and risk that target your organization. We're seeing lots of executive inquiries.

Hey, how, how are we protected from things like change and extension and now CrowdStrike? The big takeaway don't know if you picked up on the news article or the release that the CEO from S1 sent out. S1 deploys a very small percentage of their ecosystem, like one to 2 percent when they do a push to see if there's any implications, they don't have a global outage, they can roll back and it's in the impact is a very small, I would imagine CrowdStrike is going to be picking up a similar deployment strategy going forward, I would think.

I think when it's, there's so much stuff that's going to come out. of the root cause analysis. You talked about how they have done a really good job of communicating. I think they've been incredibly transparent very apologetic. This is I, my feeling generally is okay, this is good.

Keep behaving like that.

📍 📍 📍 📍 Hi everyone, I'm Sarah Richardson, president of the 229 Executive Development Community at This Week Health. I'm thrilled to share some exciting news with you. I'm launching a new show on our conference channel called Flourish. In Flourish, we dive into captivating career origin stories, offering insights and inspiration to help you thrive in your own career journey.

Whether you're a health system employee in IT or a partner looking to understand the healthcare landscape better, Flourish has something valuable for you. It's all about gaining perspectives and finding motivation to flourish in your career. .

You can tune in on ThisWeekHealth. com or wherever you listen to podcasts. Stay curious, stay inspired, and keep flourishing. I can't wait for you to join us on this journey.

I think as they get through the RCA and they start to understand the bloody details about what happened, That's going to be critical too, the transparency around everything that they see and everything that they find and the changes that they're going to make to keep this from ever happening again.

Customers and everyone's going to be really interested in hearing the rest of that story.

Yeah, the big question in the back of my mind is would they do a global push an agent to every Windows, agent out there running, right? Why would they do such a large push, right? Knowing the risk that if they got it wrong.

Don't get me wrong. CrowdStrike has been a very stable platform for many years. So this is an anomaly. So maybe somebody didn't follow, proper procedure here or something. Hopefully they're transparent enough to share that with us so that we can use those lessons learned.

Yeah,

in our own organizations and our own practices.

Yeah, one of the other lessons in all of this, and when I do the two minute drill, one of the last things I always say is, stay a little paranoid. And I think the paranoid in this was probably appropriately placed because it hardly took any time until the bad guys were starting to go down the path of phishing and calling and identifying themselves as CrowdStrike help desk guys to try to suck people into, in a time of frenzy and emergency, trying to take advantage of that.

They're real jerks, man, but they really figure out how to do this stuff.

We did see that. We did start seeing what was known as vishing or voice vishing, where people dialing in, acting as CrowdStrike trying to use that opportunity of chaos to slide in and either gain a foothold or gain access or get credentials reset offering up, advice and assistance immediately produced a threat bulletin utilizing CrowdStrike specific content.

There was a ton of people, giving all sorts of recommendations and advice. I generally follow the rule of thumb is CrowdStrike's the expert. It's their system. They understand the behind the scenes stuff. It's best to follow their procedures and be safe about it.

So we helped share that broadly with all of our ecosystem, get the threat bulletins out there and try to help educate. What we know. Early day statements, as we all know, can be misleading at times because very quick to get communication out there, but following and staying no different than we did with change, staying in sync with what change is communicating.

Same here. Bottle of CrowdStrike and be careful around other people, because there was a ton of conspiracy theories that started around, where people rebooting their machines and sliding other malware into it during the reboots and things like that and I think where people We're trying to use fear and certainty and doubt to slip things in.

But it's a lesson learned. There's a lot going on here, but but I don't, there's no indication that there was any other malicious activity tied to any of this at this time.

There's also just the conflation of when it's a cybersecurity company that's having the problem, then.

Your natural, your brain can't help but put these things together. Oh, it must be a cyber security incident. And in fact, this wasn't really a cyber security incident. This was a major worldwide outage incident that was caused by a cyber security company, but It wasn't an adversary involved based on everything we know at this point.

And, similarly change healthcare was a cyber event, right? But the secondary impact was a worldwide outage of not being able to process claims. And which created a revenue recognition problem for most organizations. Not a cyber event for everyone else. But this one obviously had no cyber threat actor involved.

It was self induced. Yeah.

There's a part of this that my empathy bone also aches a little bit because having been a CIO for 30 years, I've probably had a couple of those little situations too, where I've self inflicted wound and had to figure out how to, get myself out of it.

So that kind of stuff happens. It's it's definitely the last thing probably want to talk about is just the superheroes who put on the capes again. In, the organizations and went out and saved healthcare.

Yeah. There's so, so many, right? All weekend long, we were exchanging with, CISOs and CIOs, and just staying abreast of the monumental lift that had to, be undertaken to go out and put hands on keyboard and all those machines, get them rebooted, get them restored.

Obviously, asset inventory is critical. Prioritization of which systems do I go reboot first, right? This one's critical. Those mobile carts on every floor so that you can at least get your nurses back connected and get your doctors and physicians and clinicians reconnected back to your EMR is one of those critical elements of it.

And then start bringing up, your critical services, ED, ambulatory, and so on and so forth. Get your radiology oncology back up and operational. So, Understanding those. This was a real world test, not only for the superheroes that had to go out and respond to it, but also just, do I have a solid asset inventory?

Do I know where all these devices are located at in the physical infrastructure, let alone putting your hands on the keyboards of those that were remote?

Yeah, I think the point too about prioritizing, having some kind of a prioritization around those endpoints too, of, if I've got a thousand machines that I need to go touch and reboot, in what order?

Which ones, help me keep patients and family safe? We go there first, but do we know that? Have we sat down and talked about it and thought about it and created a plan around the, this is where we go first, this is where we go later? I think a lot of folks. Probably did that in an ad hoc way and were really successful, but anything you can think about and plan in advance is probably good.

Yeah. Yeah. Most people will hopefully document this as a real test of their business continuity and disaster recovery plans. And learn, take the lessons learned and update those things and figure out where you can strengthen those plans. Turn a negative into a positive by using it as a valid test and a valid instance, and then let's improve upon them and be better prepared because preparedness is the only thing that we have on our side to be ready for the next one.

And. Unfortunately, there will be more to come. Realizing that, we are mid year, right? This is July, and we've already started to see the uptick. Fourth of July seems to be now the new start. It used to be Thanksgiving time is where we used to see the threat actors increase. As we enter the kind of U.

S. holiday cycle. Last year was the first year I seen them start around the 4th of July,

right?

It happened again this year. We saw an increase in threat actor activity, focusing on the 4th of July and Independence Day, right? Celebrations. And so, threat actors have got a longer window now, July to well into January and February.

So it doesn't leave many months now that we're not under attack. we just released the Mid Year Horizon Report here at Fortified, which gives a lot of the stats. And although the stats today, if you're following the OCR, doesn't indicate that this year's reported breaches were necessarily worse than last year, but it doesn't include the details with change or extension yet because they're still being ratified.

o, I'm fully predicting that: 2024

But great strides are still continuing to be made every single day, and more and more senior executives and boards are getting into more conversations. I'm being asked to go out and speak to those boards and educate them on what they can do more in their programs, which, is all part of it is us at least starting to have the right conversations on what we need to do to correct the problem.

Hey, thanks for your time today. I really appreciate you being here. We'll make sure we put a connection, a link to the horizon report when we publish this on Hack the News. So we'll make sure that's there if folks are interested in picking up and reading that. Yeah. Thanks again.

I really appreciate you being part of it.

Absolutely.

📍 📍 📍 Thanks for tuning in to Unhack the News. And while this show keeps you updated on the biggest stories, we also try to provide some context and even opinions on the latest developments. And now there's another way for you to stay ahead. Subscribe to our Daily Insights email. What you'll get is expertly curated health IT news straight to your inbox, ensuring you never miss a beat.

Sign up at thisweekhealth. com slash news. I'm your host, Rex DeFord. Thanks for spending some time with me today. And that's it for Unhack the News.

As always, stay a little paranoid, and I'll see you around campus.