The page has been translated by Gen AI.

Availability Goal

Availability refers to the ability to reliably provide functions and performance that meet users’ expectations when they attempt to use IT resources or services.

When designing or building IT services in an on‑premises environment, first review the service’s availability requirements, then analyze each component’s failure factors to design and implement either a single configuration or redundancy/multiplicity solutions.

Identify High-Availability Implementation Targets

Task importance assessment

To identify tasks for high-availability implementation, we assess task importance based on the following criteria.

Evaluation criteria	Content
Business impact	Estimated revenue loss in case of system downtime Costs due to customer churn or service contract violations
Customer Service Impact	Impact of System Outage on Customer Experience Potential for Customer Complaints or Claims Alternative Service Provision Plan in Case of Customer Service Disruption
Compliance	Regulatory compliance obligations Potential legal issues if violated

Evaluation criteria

Content

Business impact

Estimated revenue loss in case of system downtime

Costs due to customer churn or service contract violations

Customer Service Impact

Impact of System Outage on Customer Experience

Potential for Customer Complaints or Claims

Alternative Service Provision Plan in Case of Customer Service Disruption

Compliance

Regulatory compliance obligations

Potential legal issues if violated

Table. Work importance assessment

We sometimes use impact analysis to evaluate task importance.

Business Impact Analysis (Business Impact Analysis) is covered in more detail in the reliability design principles.

High Availability Requirements and Countermeasures

To achieve high availability, you must be able to establish appropriate failure mitigation measures based on the components or causes of failures.

The table below shows the high-availability implementation requirements for failure or disaster scenarios and examples of how to implement them in a cloud environment.

Category	Requirement example	Implementation plan example
Cloud data center outage or disaster	Design an architecture that does not interrupt service even in the event of a physical data center failure of the cloud service provider	Multi-AZ-based redundant configuration
Cloud service outage	Design a high-availability architecture that can achieve fault tolerance even in the event of a single server (VM) failure Configure a high-availability database server	Active-Active or Active-Standby high-availability implementation
Demand surge	Design a cloud architecture that dynamically adjusts resources in response to changing demand and elastically handles load. Use a managed service that inherently provides load‑handling capabilities, so the user does not have to design or manage load response directly.	Auto-Scaling or managed service implementation
Attack/Intrusion	Design an architecture that minimizes service downtime even during DDoS attacks	DDoS Protection configuration Auto-Scaling configuration

Table. Example of requirements and implementation measures by disability type

Service Level Metrics and Goal Setting

When designing for availability, applying high availability to every component can involve various constraints such as budget and operational staff.

When specifying availability requirements for critical systems, customers typically demand a “high availability” or “redundancy” level.

However, the implication of this customer requirement is not merely that components should be duplicated; it also means that the system must be designed so that the entire service does not stop even if a single component fails.

When analyzing high-availability requirements, it is necessary to derive objective metrics that enable systematic management of the customer’s comprehensive high-availability demands, and to set goals based on them.

The key management items and metrics related to this can be found in the table below.

Item	Content
Service Level Agreement(SLA) Service Level Agreement	An official contract between the service provider and the customer that defines the expected level of the provided service The SLA includes specific service level objectives such as availability, performance, and support response time It also specifies the conditions under which the service provider must compensate the customer if these objectives are not met
Service Level Objective(SLO) Service Level Objective	refers to a specific service level objective defined within the SLA An SLO is a measurable objective, for example, expressed as “maintain service availability of at least 99.9%" An SLO is a criterion that the service provider must meet, serving as a detailed target to satisfy the SLA
Service Level Indicator(SLI) Service Level Indicator	Specific metrics used to measure SLO SLI expresses service performance or availability numerically, for example, it can be represented as “average response time over the past 30 days” SLI provides essential data for evaluating whether the SLO is met
Mean Time to Detection(MTTD) average detection time	Average time between failure occurrence and start of repair work
Mean Time to Repair(MTTR) average recovery time	From the time a device or system failure occurs to the point when repair at the failure location is completed and operation can resume, the average time MTTR = (F1 + F2 …Fn)/n [F: downtime, n: number of failures]
Mean Time To Failure(MTTF) Mean time between failures	Average fault-free operating time from repair completion to the next failure MTTF = (T1 + T2…Tn)/n [T: operating time, n: number of failures]

Table. Service goals and metrics

The relationship among MTTD, MTTR, and MTTF can be understood from the diagram below.

Concept diagram — Figure. Availability metric

Earlier, we expressed the availability metric, availability rate, as 1 - {sum of downtime (minutes)}/{sum of total service usage time (minutes)}.

If we express availability using the above metric, it can be represented as MTTF/(MTTF + MTTR).

To increase availability, minimizing MTTR, i.e., the mean time to recovery, is essential.

MTTR consists of MTTD and recovery time, where MTTD refers to the time elapsed from a failure occurring to its detection and the start of recovery actions.

High availability means that MTTD and recovery actions are performed automatically without human intervention, and this is a primary review item of this availability design principle.

To improve availability, you must configure monitoring, alerts, and response automation to detect failures and enable automatic remediation, thereby minimizing MTTD.

Additionally, by automating redundant resource configuration and resource scaling, we can automate incident response, minimize MTTR, and improve availability.

Samsung Cloud Platform Availability

In cloud environments, we design and implement either a single configuration or redundancy/multi-redundancy according to service availability requirements. However, failure analysis is performed based on the availability specified in the Service Level Agreement (SLA) provided by the cloud provider.

In Samsung Cloud Platform, the SLA provides the monthly availability rate (%) for each service.

The table below shows the monthly availability calculation formulas and term definitions for each service provided by Samsung Cloud Platform.

There are services that calculate availability based on uptime, and services that calculate availability based on the number of requests or incidents.

service	Disability definition	Monthly availability (%)	Definition of downtime
Common*	If the customer’s running instance or individual services cannot secure external connections and access.	[1 - {sum of downtime (minutes) / sum of total service usage time per month (minutes)}] * 100	Total outage duration for the month caused by the company (SDS)
Virtual Server DR	If, after completing a simulation exercise or disaster failover in the DR Recovery Plan, customer service access is unavailable or the requested failover does not occur.	[1 - {sum of outage time (minutes) / sum of total service usage time per month (minutes)}] * 100	Total downtime for the month caused by the company (SDS)
DBaaS, Event Streams, Search Engine	If all running multi‑instances or individual services fail to maintain external connections and access for more than 5 minutes.	[1 - {sum of outage time (minutes) / sum of total service usage time per month (minutes)}] * 100	Total outage time for the month caused by the company (SDS)
Kubernetes Engine	If external connections and access to the control plane cannot be secured for more than 5 minutes.	[1 - {sum of outage time (minutes) / sum of total service usage time per month (minutes)}] * 100	Total outage duration for the month caused by the company (SDS)
Container Registry	When all connection requests to the Container Registry Endpoint fail for 5 minutes.	[1 - {sum of outage time (minutes) / sum of total service usage time per month (minutes)}] * 100	Total outage duration for the month caused by the company (SDS)
DDoS Protection, Secured VPN, Secured Firewall, WAF, IPS, SASE	If the security equipment supplied by the company malfunctions or fails to operate, leading to a failure to monitor and detect customer services.	[1 - {sum of outage time (minutes) / sum of total service usage time per month (minutes)}] * 100	Total downtime for the month caused by the company (SDS)
File Storage, Block Storage	If a timeout of more than 15 seconds occurs from Compute to Storage.	[1 - {sum of outage time (minutes) / sum of total service usage time per month (minutes)}] * 100	Total downtime for the month caused by the company (SDS)
Transit Gateway, Direct Connect	Port is down, or network errors in the connection segment prevent information transmission and reception for more than 120 seconds.	[1 - {sum of downtime (minutes) / sum of total service usage time per month (minutes)}] * 100	Total outage duration for the month caused by the company (SDS)
Edge Server	Customer service failure due to inability to access Edge Manager or Edge Server outage	[1 - {sum of outage time (minutes) / sum of total service usage time per month (minutes)}] * 100	Total downtime for the month due to the company’s (SDS) reasons
Private Cloud	When a customer’s running instance or individual service fails to obtain CMP access.	[1 - {sum of downtime (minutes) / sum of total service usage time per month (minutes)}] * 100	Total downtime for the month caused by the company (SDS)
Cloud Functions	Return a 500 or 503 error code for the request (excluding custom errors)	[1 - (sum of outage time and count)/(sum of total service usage month time and count)] * 100	Sum of failure duration and incident count for each 5‑minute period
Object Storage, Archive Storage	Return 500 error for Storage request Failure rate: rate of failure requests Average failure rate: monthly average failure rate	100% - average failure rate	-
API Gateway	Return a 500 or 503 error code for the request	[1 - (total failures)/(total service usage months)] * 100	Total number of error requests per 5‑minute interval
Quick Query	When an SQL query request fails to secure an external connection and access for more than 5 minutes.	[1 - (sum of failure counts)/(sum of total service usage month counts)] * 100	Total number of incidents that occurred in the month due to company reasons
Backup	Customer-requested backup failure (exclude if re-execution succeeds)	[1 - (sum of failure counts)/(sum of total service usage month counts)] * 100	Total number of incidents that occurred in the month due to company reasons

*Common representative services include Virtual Server, Bare Metal Server, Load Balancer, GSLB, VPN, SingleID, DevOps Service, AIOS, etc.

Table. Service availability

Samsung Cloud Platform’s Well-Architected design principles provide design guidelines on availability (Availability) and reliability (Reliability) from the perspective of fault tolerance and resilience.

The availability design principle focuses on designing the architecture around automated response measures to ensure that the service does not become unavailable even in the event of a component failure or an unexpected surge in load.

The reliability design principle focuses on minimizing data loss and enabling rapid service recovery in the event of unplanned failures or disasters.

The core of availability design lies in implementing high availability (High Availability) by deploying services redundantly so that, even if a component fails, other components continue to operate normally, preventing service interruption.

Additionally, securing scalability (Scalability) to prevent bottlenecks in specific components caused by increased demand, which can delay or interrupt service responses, is also an important goal of availability design.