The page has been translated by Gen AI.

Availability Goal

Availability Goal

Availability refers to the ability to reliably provide functions and performance that meet users’ expectations when they attempt to use IT resources or services.

When designing or building IT services in an on‑premises environment, first review the service’s availability requirements, then analyze each component’s failure factors to design and implement either a single configuration or redundancy/multiplicity solutions.

Identify High-Availability Implementation Targets

Task importance assessment

To identify tasks for high-availability implementation, we assess task importance based on the following criteria.

Evaluation criteriaContent
Business impact
  • Estimated revenue loss in case of system downtime
  • Costs due to customer churn or service contract violations
Customer Service Impact
  • Impact of System Outage on Customer Experience
  • Potential for Customer Complaints or Claims
  • Alternative Service Provision Plan in Case of Customer Service Disruption
Compliance
  • Regulatory compliance obligations
  • Potential legal issues if violated
Table. Work importance assessment

We sometimes use impact analysis to evaluate task importance.

Business Impact Analysis (Business Impact Analysis) is covered in more detail in the reliability design principles.

High Availability Requirements and Countermeasures

To achieve high availability, you must be able to establish appropriate failure mitigation measures based on the components or causes of failures.

The table below shows the high-availability implementation requirements for failure or disaster scenarios and examples of how to implement them in a cloud environment.

CategoryRequirement exampleImplementation plan example
Cloud data center outage or disaster
  • Design an architecture that does not interrupt service even in the event of a physical data center failure of the cloud service provider
Multi-AZ-based redundant configuration
Cloud service outage
  • Design a high-availability architecture that can achieve fault tolerance even in the event of a single server (VM) failure
  • Configure a high-availability database server
Active-Active or
Active-Standby high-availability implementation
Demand surge
  • Design a cloud architecture that dynamically adjusts resources in response to changing demand and elastically handles load.
  • Use a managed service that inherently provides load‑handling capabilities, so the user does not have to design or manage load response directly.
Auto-Scaling or
managed service implementation
Attack/Intrusion
  • Design an architecture that minimizes service downtime even during DDoS attacks
DDoS Protection configuration
Auto-Scaling configuration
Table. Example of requirements and implementation measures by disability type

Service Level Metrics and Goal Setting

When designing for availability, applying high availability to every component can involve various constraints such as budget and operational staff.

When specifying availability requirements for critical systems, customers typically demand a “high availability” or “redundancy” level.

However, the implication of this customer requirement is not merely that components should be duplicated; it also means that the system must be designed so that the entire service does not stop even if a single component fails.

When analyzing high-availability requirements, it is necessary to derive objective metrics that enable systematic management of the customer’s comprehensive high-availability demands, and to set goals based on them.

The key management items and metrics related to this can be found in the table below.

ItemContent
Service Level Agreement(SLA)
Service Level Agreement
  • An official contract between the service provider and the customer that defines the expected level of the provided service
  • The SLA includes specific service level objectives such as availability, performance, and support response time
  • It also specifies the conditions under which the service provider must compensate the customer if these objectives are not met
Service Level Objective(SLO)
Service Level Objective
  • refers to a specific service level objective defined within the SLA
  • An SLO is a measurable objective,
    for example, expressed as “maintain service availability of at least 99.9%"
  • An SLO is a criterion that the service provider must meet, serving as a detailed target to satisfy the SLA
Service Level Indicator(SLI)
Service Level Indicator
  • Specific metrics used to measure SLO
  • SLI expresses service performance or availability numerically,
    for example, it can be represented as “average response time over the past 30 days”
  • SLI provides essential data for evaluating whether the SLO is met
Mean Time to Detection(MTTD)
average detection time
  • Average time between failure occurrence and start of repair work
Mean Time to Repair(MTTR)
average recovery time
  • From the time a device or system failure occurs
    to the point when repair at the failure location is completed and operation can resume, the average time
  • MTTR = (F1 + F2 …Fn)/n [F: downtime, n: number of failures]
Mean Time To Failure(MTTF)
Mean time between failures
  • Average fault-free operating time from repair completion to the next failure
  • MTTF = (T1 + T2…Tn)/n [T: operating time, n: number of failures]
Table. Service goals and metrics

The relationship among MTTD, MTTR, and MTTF can be understood from the diagram below.

Concept diagram
Figure. Availability metric

Earlier, we expressed the availability metric, availability rate, as 1 - {sum of downtime (minutes)}/{sum of total service usage time (minutes)}.

If we express availability using the above metric, it can be represented as MTTF/(MTTF + MTTR).

To increase availability, minimizing MTTR, i.e., the mean time to recovery, is essential.

MTTR consists of MTTD and recovery time, where MTTD refers to the time elapsed from a failure occurring to its detection and the start of recovery actions.

High availability means that MTTD and recovery actions are performed automatically without human intervention, and this is a primary review item of this availability design principle.

To improve availability, you must configure monitoring, alerts, and response automation to detect failures and enable automatic remediation, thereby minimizing MTTD.

Additionally, by automating redundant resource configuration and resource scaling, we can automate incident response, minimize MTTR, and improve availability.

Samsung Cloud Platform Availability

In cloud environments, we design and implement either a single configuration or redundancy/multi-redundancy according to service availability requirements. However, failure analysis is performed based on the availability specified in the Service Level Agreement (SLA) provided by the cloud provider.

In Samsung Cloud Platform, the SLA provides the monthly availability rate (%) for each service.

The table below shows the monthly availability calculation formulas and term definitions for each service provided by Samsung Cloud Platform.

There are services that calculate availability based on uptime, and services that calculate availability based on the number of requests or incidents.

serviceDisability definitionMonthly availability (%)Definition of downtime
Common*If the customer’s running instance or individual services cannot secure external connections and access.[1 - {sum of downtime (minutes) / sum of total service usage time per month (minutes)}] * 100Total outage duration for the month caused by the company (SDS)
Virtual Server DRIf, after completing a simulation exercise or disaster failover in the DR Recovery Plan, customer service access is unavailable or the requested failover does not occur.[1 - {sum of outage time (minutes) / sum of total service usage time per month (minutes)}] * 100Total downtime for the month caused by the company (SDS)
DBaaS, Event Streams, Search EngineIf all running multi‑instances or individual services fail to maintain external connections and access for more than 5 minutes.[1 - {sum of outage time (minutes) / sum of total service usage time per month (minutes)}] * 100Total outage time for the month caused by the company (SDS)
Kubernetes EngineIf external connections and access to the control plane cannot be secured for more than 5 minutes.[1 - {sum of outage time (minutes) / sum of total service usage time per month (minutes)}] * 100Total outage duration for the month caused by the company (SDS)
Container RegistryWhen all connection requests to the Container Registry Endpoint fail for 5 minutes.[1 - {sum of outage time (minutes) / sum of total service usage time per month (minutes)}] * 100Total outage duration for the month caused by the company (SDS)
DDoS Protection, Secured VPN, Secured Firewall, WAF, IPS, SASEIf the security equipment supplied by the company malfunctions or fails to operate, leading to a failure to monitor and detect customer services.[1 - {sum of outage time (minutes) / sum of total service usage time per month (minutes)}] * 100Total downtime for the month caused by the company (SDS)
File Storage, Block StorageIf a timeout of more than 15 seconds occurs from Compute to Storage.[1 - {sum of outage time (minutes) / sum of total service usage time per month (minutes)}] * 100Total downtime for the month caused by the company (SDS)
Transit Gateway, Direct ConnectPort is down, or network errors in the connection segment prevent information transmission and reception for more than 120 seconds.[1 - {sum of downtime (minutes) / sum of total service usage time per month (minutes)}] * 100Total outage duration for the month caused by the company (SDS)
Edge ServerCustomer service failure due to inability to access Edge Manager or Edge Server outage[1 - {sum of outage time (minutes) / sum of total service usage time per month (minutes)}] * 100Total downtime for the month due to the company’s (SDS) reasons
Private CloudWhen a customer’s running instance or individual service fails to obtain CMP access.[1 - {sum of downtime (minutes) / sum of total service usage time per month (minutes)}] * 100Total downtime for the month caused by the company (SDS)
Cloud FunctionsReturn a 500 or 503 error code for the request (excluding custom errors)[1 - (sum of outage time and count)/(sum of total service usage month time and count)] * 100Sum of failure duration and incident count for each 5‑minute period
Object Storage, Archive StorageReturn 500 error for Storage request
Failure rate: rate of failure requests
Average failure rate: monthly average failure rate
100% - average failure rate-
API GatewayReturn a 500 or 503 error code for the request[1 - (total failures)/(total service usage months)] * 100Total number of error requests per 5‑minute interval
Quick QueryWhen an SQL query request fails to secure an external connection and access for more than 5 minutes.[1 - (sum of failure counts)/(sum of total service usage month counts)] * 100Total number of incidents that occurred in the month due to company reasons
BackupCustomer-requested backup failure (exclude if re-execution succeeds)[1 - (sum of failure counts)/(sum of total service usage month counts)] * 100Total number of incidents that occurred in the month due to company reasons

*Common representative services include Virtual Server, Bare Metal Server, Load Balancer, GSLB, VPN, SingleID, DevOps Service, AIOS, etc.

Table. Service availability

Samsung Cloud Platform’s Well-Architected design principles provide design guidelines on availability (Availability) and reliability (Reliability) from the perspective of fault tolerance and resilience.

The availability design principle focuses on designing the architecture around automated response measures to ensure that the service does not become unavailable even in the event of a component failure or an unexpected surge in load.

The reliability design principle focuses on minimizing data loss and enabling rapid service recovery in the event of unplanned failures or disasters.

The core of availability design lies in implementing high availability (High Availability) by deploying services redundantly so that, even if a component fails, other components continue to operate normally, preventing service interruption.

Additionally, securing scalability (Scalability) to prevent bottlenecks in specific components caused by increased demand, which can delay or interrupt service responses, is also an important goal of availability design.