Establishing Availability Goals
Establishing Availability Goals
Availability refers to the ability to reliably provide functions and performance that meet a user’s expectations when they intend to use IT resources or services.
When designing or building IT services in an on‑premises environment, after reviewing the service’s availability requirements, we analyze each component’s failure factors and design and implement a single configuration or redundancy/multiplicity solution.
Identify high availability implementation targets
Work Importance Assessment
To identify tasks for high availability implementation, we evaluate task importance based on the following items.
| Evaluation Item | Content |
|---|---|
| Business Impact | - Expected revenue loss when system is down - Costs due to customer churn or service contract violations |
| Customer Service Impact | - Impact of system downtime on customer experience - Possibility of customer complaints or claims - Alternative service provision plan when customer service is interrupted |
| Regulation compliance | - Regulation compliance obligations - Whether legal issues arise upon violation |
We sometimes use business impact analysis to assess the importance of work.
Business Impact Analysis is covered in more detail in the reliability design principles.
High Availability Requirements and Countermeasures
To implement high availability, you must be able to establish appropriate fault handling measures based on the components or causes of failures.
The table below shows the high-availability implementation requirements for fault or disaster situations, and examples of how to implement them in a cloud environment.
| Category | Example of requirements | Example of implementation plan |
|---|---|---|
| Cloud data center failure or disaster | - Design an architecture that does not interrupt service even if the physical data center of the cloud service provider fails. | Multi-AZ based redundant configuration |
| Cloud Service Failure | - Design a high-availability architecture that can achieve fault tolerance even in the event of a single server (VM) failure. - Configure a high-availability database server. | Active-Active or Active-Standby high-availability implementation |
| Demand Surge | - Design a cloud architecture that dynamically adjusts resources in response to changing demand and elastically handles load. - Use a managed service that inherently provides load handling capabilities without the user having to design or manage load handling themselves. | Auto-Scaling or Managed Service Implementation |
| Attack/Intrusion | - Design an architecture that can minimize service downtime even for DDoS attacks. | DDoS Protection configuration Auto-Scaling configuration |
Service Level Indicators and Goal Establishment
When designing availability, applying high availability to all components may be subject to various constraints such as budget or operational staff.
When customers present availability requirements for critical systems, they generally demand a level of “high availability” or “redundancy”.
However, the meaning implied by these customer requirements is not merely that components should be duplicated, but that the system must be designed so that even if a single component fails, the entire service is not interrupted.
When analyzing high availability requirements, it is necessary to derive objective metrics so that the customer’s comprehensive high availability demands can be systematically managed, and to set goals based on them.
The main management items and indicators related to this can be found in the table below.
| Item | Content |
|---|---|
| SLA (Service Level Agreement, Service Level Agreement) | -An official contract between the service provider and the customer that defines the expected level of the provided service. - SLA includes specific service level objectives such as availability, performance, and support response time. - It also specifies conditions under which the service provider must compensate the customer if these objectives are not met. |
| SLO (Service Level Objective, service level objective) | -refers to a specific service level objective specified within the SLA -SLO is a measurable target, expressed in forms such as “maintain service availability of at least 99.9%" -SLO is a criterion that the service provider must comply with, a detailed goal to meet the SLA |
| SLI (Service Level Indicator, service level indicator) | -Specific metrics used to measure SLO -SLI expresses service performance or availability numerically, for example, it can be represented as “average response time over the past 30 days” -SLI provides essential data to assess whether SLO is achieved |
| MTTD (Mean Time to Detection, average detection time) | -average time between failure occurrence and start of repair work |
| MTTR (Mean Time to Repair,Average Recovery Time) | -From the point when a device or system failure occurs to the point when the repair at the failure location is completed and operation becomes possible, the average time - MTTR = (F1 + F2 …Fn)/n [F: time of failure, n: number of failures] |
| MTTF (Mean Time To Failure, average failure interval) | -Average value of the time operating without failure from the completion of repair to the next failure -MTTF = (T1 + T2…Tn)/n [T: operating time, n: number of failures] |
MTTD, MTTR, MTTF’s relationship can be understood from the figure below.
Previously, the availability metric, availability rate, was expressed as 1 - {sum of downtime (minutes)}/{sum of total service usage time (minutes)}.
If we express availability through the above metric, it can be expressed as MTTF/(MTTF + MTTR).
To increase availability, minimizing MTTR, i.e., the mean time to recovery, is key.
MTTR consists of MTTD and recovery time; MTTD refers to the time from when a failure occurs, is detected, and recovery actions begin.
High availability means that MTTD and recovery actions are performed automatically without human intervention, and this is a major review target of this availability design principle.
To improve availability, you need to configure monitoring, alerts, and response automation to detect failures and enable automatic actions, thereby minimizing MTTD.
Additionally, by automating resource redundancy configuration and resource scaling, we can automate fault response, minimize MTTR, and improve availability.
Samsung Cloud Platform Availability
In cloud environments, we design and implement single configurations or redundancy/multiplicity according to service availability requirements. However, failure element analysis is performed based on the availability rate of the Service Level Agreement (SLA) presented by the cloud provider.
Samsung Cloud Platform provides monthly service availability rates (%) through SLA.
The table below shows the monthly availability calculation formulas and term definitions for each service provided by Samsung Cloud Platform.
There are services that calculate availability based on operating time, and there are services that calculate availability based on the number of requests or the number of incidents.
| Service | Failure Definition | Monthly Availability (%) | Failure Time Definition |
|---|---|---|---|
| Common* | When the operating customer’s instance or individual service fails to secure external connections and access | [1 - {sum of downtime (minutes) / total service usage time for the month (minutes)}] * 100 | Sum of downtime that occurred in the month due to the company’s (SDS) reasons |
| Virtual Server DR | If, after the mock training or disaster transition in the DR Recovery Plan is completed, customer service access is unavailable or the requested transition is not performed | [1 - {sum of downtime (minutes) / sum of total service usage time per month (minutes)}] * 100 | Sum of downtime that occurred in the month due to reasons of the company (SDS) |
| DBaaS, Event Streams, Search Engine | When all running multi-instances or individual services fail to secure external connections and access for more than 5 minutes | [1 - {sum of downtime (minutes) / total service usage time for the month (minutes)}] * 100 | Sum of downtime that occurred in the month due to the company’s (SDS) reasons |
| Kubernetes Engine | When external connection and access to the control plane cannot be secured for more than 5 minutes | [1 - {sum of outage time (minutes) / sum of total service usage time for the month (minutes)}] * 100 | Sum of outage time that occurred in the month due to reasons of the company (SDS) |
| Container Registry | When all connection requests to the Container Registry Endpoint fail for 5 minutes | [1 - {sum of downtime (minutes) / total service usage time in the month (minutes)}] * 100 | Total downtime in the month due to reasons of the company (SDS) |
| DDoS Protection, Secured VPN, Secured Firewall, WAF, IPS, SASE | When monitoring and detection of customer services fail due to non-operation or malfunction of security equipment provided by the company | [1 - {sum of downtime (minutes) / total minutes of service usage per month}] * 100 | Total downtime in the month caused by reasons of the company (SDS) |
| File Storage, Block Storage | When a timeout of 15 seconds or more occurs from Compute to Storage | [1 - {sum of downtime (minutes) / total service usage time per month (minutes)}] * 100 | Sum of downtime that occurred in the month due to the company’s (SDS) reason |
| Transit Gateway, Direct Connect | Port is down, or due to network errors in the connection segment, information cannot be transmitted/received for more than 120 seconds | [1 - {Sum of outage time (minutes) / Sum of total service usage time for the month (minutes)}] * 100 | Total outage time that occurred in the month due to reasons of the company (SDS) |
| Edge Server | Edge Manager connection unavailable or customer service failure due to Edge Server malfunction | [1 - {sum of downtime (minutes) / total service usage time for the month (minutes)}] * 100 | Sum of downtime that occurred in the month due to the company’s (SDS) reasons |
| Private Cloud | When the operating customer’s instances or individual services all fail to secure CMP access | [1 - {downtime (minutes) sum / total service usage month total time (minutes) sum}] * 100 | Sum of downtime that occurred in the month due to reasons of the company (SDS) |
| Cloud Functions | Return 500 or 503 error code for requests (excluding custom errors) | [1 - (sum of outage time and count)/(sum of total service usage month time and count)] * 100 | sum of outage time and count per 5 minutes |
| Object Storage, Archive Storage | Returns 500 error for storage request Failure rate: ratio of failed requests Average failure rate: monthly average failure rate | 100% - average failure rate | - |
| API Gateway | Return 500 or 503 error code for requests | [1 - (sum of failure counts)/(sum of total service usage counts per month)] * 100 | Sum of request counts that encountered errors every 5 minutes |
| Quick Query | When the SQL query request fails to secure external connection and access for more than 5 minutes Downtime: | [1 - (sum of incident counts)/(total number of service usage incidents for the month)] * 100 | Total number of incidents that occurred in the month due to company reasons |
| Backup | Customer requested backup failure (excluding when re-execution succeeds) | [1 - (sum of incident counts)/(total number of service usage months)] * 100 | Sum of incidents that occurred in the month due to company reasons |
*The typical common services are Virtual Server, Bare Metal Server, Load Balancer, GSLB, VPN, SingleID, DevOps Service, AIOS, etc.
Samsung Cloud Platform’s Well-Architected design principles present design principles regarding availability and reliability in terms of fault and resilience.
The principle of availability design focuses on designing the architecture around automated response measures to ensure that the service does not get interrupted even in cases of component failures or unexpected spikes in load.
The principle of safety design focuses on minimizing data loss and enabling rapid service recovery in the event of unplanned failures or disasters.
The core of availability design is to implement high availability (High Availability) by deploying services redundantly so that even if a component fails, other components operate normally, preventing service interruption.
Also, ensuring scalability to prevent bottlenecks in specific components caused by increased demand, which can delay or interrupt service responses, is an important goal of availability design.
