High availability (HA) is a state of continuous operation in a computer system or IT component for a specified length of time. High availability may also refer to an agreed level of operational performance (usually uptime) assured for a higher than normal period. Availability is often measured against a 100% operational or never-fails standard. A common standard of availability is known as “five 9s,” 99.999% availability. Two 9s would be a system that guarantees 99% availability in a one-year period, allowing up to 1% downtime, or 3.65 days of unavailability. Service level agreements (SLAs) often use monthly downtime or availability percentages for billing calculation.
The increased demand for reliable infrastructures running business-critical systems has made reducing downtime and eliminating single-points-of-failure just as important as high availability. For example, hospitals and data centers require high availability of their systems—and no unscheduled downtime—to perform daily tasks. Unscheduled downtime may be a hardware or software failure, or adverse environmental conditions such as power outages, flooding or temperature changes. Scheduled downtime for system updates and maintenance are often not included in availability percentages.
Reliability engineering uses three principles of systems design to help achieve high availability: elimination of single-points-of-failure; reliable crossover or failover points; and failure detection capabilities. High availability of data access and storage is often required in government, healthcare and other compliance-regulated industries. Highly available systems must recover from server or component failure automatically. A distributed approach can achieve this with multiple redundant nodes connected as a cluster, where each node is capable of failure detection and recovery. SUSE Enterprise Storage is an example of a highly available system designed to have no single-points-of-failure.