Massive Outage Highlights Need for a New Generation of Resilient Operating System

Share
Share

A massive outage affected multiple businesses worldwide due to a routine application update, highlighting the critical need for a new generation of resilient operating systems like SUSE Linux Enterprise Micro in large-scale deployments.

Today, we woke up to a massive outage affecting multiple businesses worldwide. Airlines, banks, credit card companies, and other industries were impacted. Many wonder how a simple maintenance update in one of hundreds of applications left many systems unable to boot, causing business outages and chaos. The manual and local fixes required for each affected device mean it could take weeks to recover fully. This incident, involving a widespread issue with Windows and CrowdStrike, highlights the critical need for a more resilient operating system for large-scale deployments.

The Root Cause

The root cause of this disruption was a routine application update that led to systems failing with a Blue Screen of Death (BSOD) every time they rebooted. The affected devices included desktops, servers, terminals, and edge devices, amplifying the recovery challenges due to their dispersed locations and manual resolution. Such incidents emphasize the vulnerabilities inherent in traditional operating systems when managing extensive IT infrastructures.

The Scale Problem

Representation of regular maintenance and updates.

As we scale, we must be concerned about how something as routine as a maintenance update failure can dramatically affect our business continuity and potentially our brand and reputation. What makes it worse is that this will not be the last maintenance error. The only question is when the next one will occur. Application errors, administrator mistakes, and other issues can always happen.

In the past, when everything was in central systems with manual procedures, administrators could access machines directly, and resolution was not too time-consuming. They could go locally to the systems and repair them in hours. However, today, in dispersed environments with administration done via automation and software, massive distribution across the edge, cloud, or remote systems means unbootable systems could take weeks of work and potential travel to recover everything.

Learnings

This is where we understand why we need a new generation of operating systems designed to be always ready to service. Operating systems that are always ready to boot in a ready-to-service state with automated health checks and rollback capabilities, making recovery time negligible and allowing administrators to perform repairs remotely if needed.

The Need for an Immutable, Transactional, Enterprise-Ready OS Supporting Full Rollback

Although some general-purpose Linux like SUSE Linux Enterprise Server offers excellent resiliency and rollback capabilities backed by its Btrfs filesystem, it can’t cover all the cases in which an error makes the system unusable. To have an always ready-to-service (and boot) OS, we need an immutable, transactional operating system like SUSE Linux Enterprise Micro (SLE Micro). Unlike traditional systems, SLE Micro offers automated health checks and rollback capabilities, ensuring that any maintenance error can be undone, leaving a booted system ready to service and effortlessly corrected centrally without manual intervention. This system guarantees that devices can consistently boot, reducing downtime and maintenance costs.

Benefits in Large-Scale Environments

In large-scale environments, where the complexity of IT management is magnified, the benefits of an OS like SLE Micro become evident. Its transactional updates mean that changes are automatically rolled back if they don’t pass all predefined health checks, preventing incomplete or faulty updates from affecting system operations. In the event of an error, the OS can seamlessly roll back to a previous stable state, ensuring continuity of service.

Infrastructure Ready for the Edge

We have learned that the edge is a very specific scenario where incidents like these can multiply the impact. Therefore, edge infrastructure must be equipped with solutions designed to handle such situations. SUSE Edge leverages SUSE Linux Enterprise Micro to provide a robust solution. SUSE Edge ensures that dispersed and remote systems are always in a ready-to-service state, offering automated health checks and rollback capabilities. This makes managing and recovering edge devices efficient and reliable, significantly reducing the risk and impact of system failures. Learn more about SUSE Edge and its capabilities here.

An Additional Learning: The Need to Implement Processes for Patching

To further minimize risks, it’s crucial to implement processes to test patches and use staging environments before deploying updates including not only OS patches but all the applications. Tools like SUSE Manager can facilitate and automate this process by managing patch testing and staging in preproduction environments, ensuring updates are reliable and reducing the likelihood of system failures.

Conclusion

The recent outage is a stark reminder of the risks associated with conventional operating systems in managing extensive and remote IT estates. By adopting an always-ready-to-service OS like SUSE Linux Enterprise Micro, organizations can mitigate such risks, ensuring a more resilient and manageable IT environment.

Explore the capabilities of SLE Micro in my previous blog on optimizing software delivery with SUSE Linux Micro 6.0 here.

 

Share
(Visited 1 times, 1 visits today)
Sebastian Martinez
2 views
Sebastian Martinez   25+ years of experience in the tech industry and enjoying searching for creative solutions and staying up-to-date with technology trends.