The page has been translated by Gen AI.

Establishing Availability Goals

Availability refers to the ability to reliably provide functions and performance that meet a user’s expectations when they intend to use IT resources or services.

When designing or building IT services in an on‑premises environment, after reviewing the service’s availability requirements, we analyze each component’s failure factors and design and implement a single configuration or redundancy/multiplicity solution.

Identify high availability implementation targets

Work Importance Assessment

To identify tasks for high availability implementation, we evaluate task importance based on the following items.

Evaluation Item	Content
Business Impact	- Expected revenue loss when system is down - Costs due to customer churn or service contract violations
Customer Service Impact	- Impact of system downtime on customer experience - Possibility of customer complaints or claims - Alternative service provision plan when customer service is interrupted
Regulation compliance	- Regulation compliance obligations - Whether legal issues arise upon violation

Table. Work Importance Assessment

We sometimes use business impact analysis to assess the importance of work.

Business Impact Analysis is covered in more detail in the reliability design principles.

High Availability Requirements and Countermeasures

To implement high availability, you must be able to establish appropriate fault handling measures based on the components or causes of failures.

The table below shows the high-availability implementation requirements for fault or disaster situations, and examples of how to implement them in a cloud environment.

Category	Example of requirements	Example of implementation plan
Cloud data center failure or disaster	- Design an architecture that does not interrupt service even if the physical data center of the cloud service provider fails.	Multi-AZ based redundant configuration
Cloud Service Failure	- Design a high-availability architecture that can achieve fault tolerance even in the event of a single server (VM) failure. - Configure a high-availability database server.	Active-Active or Active-Standby high-availability implementation
Demand Surge	- Design a cloud architecture that dynamically adjusts resources in response to changing demand and elastically handles load. - Use a managed service that inherently provides load handling capabilities without the user having to design or manage load handling themselves.	Auto-Scaling or Managed Service Implementation
Attack/Intrusion	- Design an architecture that can minimize service downtime even for DDoS attacks.	DDoS Protection configuration Auto-Scaling configuration

Table. Example of requirements and implementation plans by disability type

Service Level Indicators and Goal Establishment

When designing availability, applying high availability to all components may be subject to various constraints such as budget or operational staff.

When customers present availability requirements for critical systems, they generally demand a level of “high availability” or “redundancy”.

However, the meaning implied by these customer requirements is not merely that components should be duplicated, but that the system must be designed so that even if a single component fails, the entire service is not interrupted.

When analyzing high availability requirements, it is necessary to derive objective metrics so that the customer’s comprehensive high availability demands can be systematically managed, and to set goals based on them.

The main management items and indicators related to this can be found in the table below.

Item	Content
SLA (Service Level Agreement, Service Level Agreement)	-An official contract between the service provider and the customer that defines the expected level of the provided service. - SLA includes specific service level objectives such as availability, performance, and support response time. - It also specifies conditions under which the service provider must compensate the customer if these objectives are not met.
SLO (Service Level Objective, service level objective)	-refers to a specific service level objective specified within the SLA -SLO is a measurable target, expressed in forms such as “maintain service availability of at least 99.9%" -SLO is a criterion that the service provider must comply with, a detailed goal to meet the SLA
SLI (Service Level Indicator, service level indicator)	-Specific metrics used to measure SLO -SLI expresses service performance or availability numerically, for example, it can be represented as “average response time over the past 30 days” -SLI provides essential data to assess whether SLO is achieved
MTTD (Mean Time to Detection, average detection time)	-average time between failure occurrence and start of repair work
MTTR (Mean Time to Repair,Average Recovery Time)	-From the point when a device or system failure occurs to the point when the repair at the failure location is completed and operation becomes possible, the average time - MTTR = (F1 + F2 …Fn)/n [F: time of failure, n: number of failures]
MTTF (Mean Time To Failure, average failure interval)	-Average value of the time operating without failure from the completion of repair to the next failure -MTTF = (T1 + T2…Tn)/n [T: operating time, n: number of failures]

Table. Service Goals and Metrics

MTTD, MTTR, MTTF’s relationship can be understood from the figure below.

Concept diagram — Figure. Availability metric

Previously, the availability metric, availability rate, was expressed as 1 - {sum of downtime (minutes)}/{sum of total service usage time (minutes)}.

If we express availability through the above metric, it can be expressed as MTTF/(MTTF + MTTR).

To increase availability, minimizing MTTR, i.e., the mean time to recovery, is key.

MTTR consists of MTTD and recovery time; MTTD refers to the time from when a failure occurs, is detected, and recovery actions begin.

High availability means that MTTD and recovery actions are performed automatically without human intervention, and this is a major review target of this availability design principle.

To improve availability, you need to configure monitoring, alerts, and response automation to detect failures and enable automatic actions, thereby minimizing MTTD.

Additionally, by automating resource redundancy configuration and resource scaling, we can automate fault response, minimize MTTR, and improve availability.

Samsung Cloud Platform Availability

In cloud environments, we design and implement single configurations or redundancy/multiplicity according to service availability requirements. However, failure element analysis is performed based on the availability rate of the Service Level Agreement (SLA) presented by the cloud provider.

Samsung Cloud Platform provides monthly service availability rates (%) through SLA.

The table below shows the monthly availability calculation formulas and term definitions for each service provided by Samsung Cloud Platform.

There are services that calculate availability based on operating time, and there are services that calculate availability based on the number of requests or the number of incidents.

Service	Failure Definition	Monthly Availability (%)	Failure Time Definition
Common*	When the operating customer’s instance or individual service fails to secure external connections and access	[1 - {sum of downtime (minutes) / total service usage time for the month (minutes)}] * 100	Sum of downtime that occurred in the month due to the company’s (SDS) reasons
Virtual Server DR	If, after the mock training or disaster transition in the DR Recovery Plan is completed, customer service access is unavailable or the requested transition is not performed	[1 - {sum of downtime (minutes) / sum of total service usage time per month (minutes)}] * 100	Sum of downtime that occurred in the month due to reasons of the company (SDS)
DBaaS, Event Streams, Search Engine	When all running multi-instances or individual services fail to secure external connections and access for more than 5 minutes	[1 - {sum of downtime (minutes) / total service usage time for the month (minutes)}] * 100	Sum of downtime that occurred in the month due to the company’s (SDS) reasons
Kubernetes Engine	When external connection and access to the control plane cannot be secured for more than 5 minutes	[1 - {sum of outage time (minutes) / sum of total service usage time for the month (minutes)}] * 100	Sum of outage time that occurred in the month due to reasons of the company (SDS)
Container Registry	When all connection requests to the Container Registry Endpoint fail for 5 minutes	[1 - {sum of downtime (minutes) / total service usage time in the month (minutes)}] * 100	Total downtime in the month due to reasons of the company (SDS)
DDoS Protection, Secured VPN, Secured Firewall, WAF, IPS, SASE	When monitoring and detection of customer services fail due to non-operation or malfunction of security equipment provided by the company	[1 - {sum of downtime (minutes) / total minutes of service usage per month}] * 100	Total downtime in the month caused by reasons of the company (SDS)
File Storage, Block Storage	When a timeout of 15 seconds or more occurs from Compute to Storage	[1 - {sum of downtime (minutes) / total service usage time per month (minutes)}] * 100	Sum of downtime that occurred in the month due to the company’s (SDS) reason
Transit Gateway, Direct Connect	Port is down, or due to network errors in the connection segment, information cannot be transmitted/received for more than 120 seconds	[1 - {Sum of outage time (minutes) / Sum of total service usage time for the month (minutes)}] * 100	Total outage time that occurred in the month due to reasons of the company (SDS)
Edge Server	Edge Manager connection unavailable or customer service failure due to Edge Server malfunction	[1 - {sum of downtime (minutes) / total service usage time for the month (minutes)}] * 100	Sum of downtime that occurred in the month due to the company’s (SDS) reasons
Private Cloud	When the operating customer’s instances or individual services all fail to secure CMP access	[1 - {downtime (minutes) sum / total service usage month total time (minutes) sum}] * 100	Sum of downtime that occurred in the month due to reasons of the company (SDS)
Cloud Functions	Return 500 or 503 error code for requests (excluding custom errors)	[1 - (sum of outage time and count)/(sum of total service usage month time and count)] * 100	sum of outage time and count per 5 minutes
Object Storage, Archive Storage	Returns 500 error for storage request Failure rate: ratio of failed requests Average failure rate: monthly average failure rate	100% - average failure rate	-
API Gateway	Return 500 or 503 error code for requests	[1 - (sum of failure counts)/(sum of total service usage counts per month)] * 100	Sum of request counts that encountered errors every 5 minutes
Quick Query	When the SQL query request fails to secure external connection and access for more than 5 minutes Downtime:	[1 - (sum of incident counts)/(total number of service usage incidents for the month)] * 100	Total number of incidents that occurred in the month due to company reasons
Backup	Customer requested backup failure (excluding when re-execution succeeds)	[1 - (sum of incident counts)/(total number of service usage months)] * 100	Sum of incidents that occurred in the month due to company reasons

*The typical common services are Virtual Server, Bare Metal Server, Load Balancer, GSLB, VPN, SingleID, DevOps Service, AIOS, etc.

Table. Service Availability Rate

Samsung Cloud Platform’s Well-Architected design principles present design principles regarding availability and reliability in terms of fault and resilience.

The principle of availability design focuses on designing the architecture around automated response measures to ensure that the service does not get interrupted even in cases of component failures or unexpected spikes in load.

The principle of safety design focuses on minimizing data loss and enabling rapid service recovery in the event of unplanned failures or disasters.

The core of availability design is to implement high availability (High Availability) by deploying services redundantly so that even if a component fails, other components operate normally, preventing service interruption.

Also, ensuring scalability to prevent bottlenecks in specific components caused by increased demand, which can delay or interrupt service responses, is an important goal of availability design.