CROWDSTRIKE & BSOD – EXPERIENCE REPORT
Events such as the CrowdStrike breach, a poor security patch or even the failure of a cloud system can suddenly turn into a real disaster for your company. That’s why disaster recovery processes, reliable systems and well-trained employees are important.
We spoke to an employee of an affected customer to find out how he experienced this situation and what conclusions he drew from it.
59s: How did you find out that your IT infrastructure was down?
Customer: I was called directly on Friday morning around 07:30 with the information that all clients and servers had a Blue Screen Of Death (BSOD). Without any further information, the first assumption was “cyber attack” or MS patches. However, patches are not released without testing them first. On site, it turned out that the BSOD on the clients and servers showed the csagent.sys file as the cause of the error. Each time the devices were restarted, the system crashed shortly after the login screen was displayed.
59s: How did you coordinate the troubleshooting internally?
Customer: Our management had started a crisis meeting while we (IT) were able to start working on the problem undisturbed. Some of the Windows 10/11 clients were still running because they were switched off or were still able to download the fix from CrowdStrike.
The service desk informed the first callers. Later, an announcement was made to prevent an overflow at the service desk.
At this point, there was no information or evidence at CrowdStrike to confirm our suspicion that the CrowdStrike agent caused the BSOD.
At first we thought we were victims of a cyber attack, but then we quickly found on subreddits (https://www.reddit.com) such as “r/crowdstrike” or “r/sysadmin” that other customers who were also affected by the problem. We found it interesting that some of the posts on “r/crowdstrike” were deleted again.
At 8:02 a.m. we tried to contact CrowdStrike. Unfortunately, we only received a response at 08:51 after what felt like an eternity. We had already seen the first instructions on how to fix this problem on the internet community Reddit at 8:39 am.
First we fixed our domain controllers. The trainees were fully involved and took care of the remaining servers, while I took care of the Ivanti EPM Core and the PXEs. Because without the core and a preferred server onsite, no fix could be distributed.
While my colleagues continued to fix the servers by hand, I worked on an OSD template to get our clients productive again.
59s: What was the specific challenge in rectifying the problem?
Customer: Solving the problem required several manual steps to be carried out directly on the computer. In other words, restarting, accessing the recovery and deleting the faulty “C-00000291*.sys”. Security was particularly time-consuming here. Our approx. 2600 clients are all encrypted with BitLocker, so that the hard disk first had to be unlocked with the BitLocker key.
59s: How did Ivanti help you solve the problem?
Customer: For Server, the fix was quite simple. I used Ivanti Endpoint Manager EPM to create an OSD task that searches for and deletes the corresponding file on each hard disk. The server was then restarted automatically and worked again. This enabled us to fix the remaining approx. 200 of 400 servers in a short time.
It was more difficult with the clients due to BitLocker. At least we were able to automatically boot the computers into a WinPE via the EPM. From there, the admins were able to unlock the hard disks and delete the file. This allowed us to interrupt the endless loop of reboots on the devices and saved the admins time. What also saved us is that the Ivanti Agent collects the recovery key. We still use Sophos to administer the BitLocker keys, but some clients had no recovery key stored in Sophos for about 3% of the time. Without the recovery key in the Ivanti EPM, we would have lost all data on the affected systems.
59s: How quickly were you able to fix the problem?
Customer: We had 99% of our global servers back up and running by 12:00 noon. Just in time for lunch. That was a precision landing, as our checkout system in the canteen was also affected 🙂.
It took us about 3 hours from the time we became aware of the workaround until the servers were fixed. All users who had been working on Friday were then able to work again.
On the following Monday, Microsoft released a WinPE for clients, which starts automatically after the BitLocker key is requested, then deletes the faulty file and restarts. I was then able to send this to our admins.
59s: What is your conclusion?
Customer: Without Ivanti, it would have been a long day or a short weekend. It helped us to make the workaround available more quickly across the board. The fact that the EPM reads the BitLocker keys unexpectedly saved us on some computers.
We will replace Sophos in the future and store the BitLocker keys in the AD and use the five(9)s console to get the recovery key quickly and easily.
Our security team will soon be having a review meeting with CrowdStrike to discuss the specific consequences.