The page has been translated by Gen AI.

Availability Management

Availability Management

Availability Check and Test

Availability Scenario Test

Best practice
Whether it can handle load and respond to failures in accordance with the specified requirements. Testing the availability scenario.

In on-premises environments, a small-scale test environment is set up to conduct testing, whereas in the cloud, a production-scale test environment that mirrors the actual deployment can be configured for testing.

By conducting tests in a separate environment without affecting users, you can verify that the service operates correctly without damaging the service or data.

Design Principles
  1. Develop test scenarios and conduct unit functional tests and integration tests.
  2. We inspect the measurement metrics according to documented Service Level Object (SLO), Service Level Agreeement (SLA), etc., and measure whether the service is operating in line with the objectives.
  3. During a load test, confirm that the required load is processed by adjusting resources.
  4. Identify bottlenecks and perform improvements.

From now on, we will review the architecture design and testing according to the availability goals.

The following shows the availability status of each major service of Samsung Cloud Platform.

serviceMaximum monthly availability rateStop time
Virtual Server (Single)99.9%43.8 minutes
Virtual Server (redundancy)99.99%4.3 minutes
Database(HA)99.95%21.9 minutes
Load Balancer99.95%21.9 minutes
Table. Service availability

If the availability target is a monthly availability of 99%, the average repair time (Mean Time To Repair, MTTR) based on a 730‑hour month is 43.8 minutes.

The three-tier architecture to achieve these availability goals is as follows.

Diagram
Figure. 99% availability target architecture

Web, App, and DB are all single configurations, and the VM stores a copy as a backup, while the DB stores a copy using the backup feature provided by the Database service.

When a failure occurs in each component, the allowable service downtime is 44 minutes according to the SLA.

The steps for conducting availability testing can be organized as follows.

ItemTest itemEstimated time
Fault detection test
  • Cloud Monitoring Event Alert Settings
Administrator analysis/response time
Failure response test
  • Restore Virtual Server from Backup
  • Public IP Switch
up to 20 minutes
Resilience test
  • Create new Database
  • Restore Database from Backup
  • Apply data changes up to the point of failure after backup
up to 60 minutes
load test
  • Virtual Server type change
  • Database type change
Maximum 15 minutes per task
Table. Availability target 99% Recovery estimated time example

When performing step-by-step recovery after a failure, the estimated recovery time is up to 90 minutes.

If you prepare a CLI script instead of using the Console for recovery tasks and perform data operations in a test environment, the work time can be further reduced.

However, the most uncertain interval in the entire recovery process is the time from a failure occurring, through its detection and notification to the administrator, to the actual start of recovery operations.

How much this time can be reduced is something that needs to be reviewed during the testing process.

The following architecture targets a monthly availability of 99.9%, which translates to an MTTR (Mean Time To Recovery) of approximately 44 minutes based on a 730‑hour month.

Diagram
Figure. 99.9% availability target architecture

Web, App, and DB are all configured with redundancy for high availability.

A Load Balancer is implemented in front of the Web and App to distribute requests across redundant servers.

The DB is deployed as a high-availability Database service and configured in an Active-Standby setup.

The maximum availability of each redundant component is within the service availability of 99.9%.

The steps for conducting availability testing can be organized as shown in the table below.

ItemTest itemEstimated time
Virtual Server
Failure detection and response test
  • Load Balancer Health Check Failure Detection and Failover
30 seconds
(default settings)
Database
Failure detection and response test
  • Switch to Standby DB after stopping the Active DB
seconds to minutes
Resilience test
  • Restore Virtual Server from backup
  • Create new Database
  • Recover Database from backup
  • Apply data changes up to the point of failure after backup
Up to 60 minutes
load test
  • Create a new VM using a Custom Image and register it with the Load Balancer
  • Change the Database server type
Maximum 15 minutes per
task
Table. Availability target 99.9% Recovery time estimate example

For each step, we shut down one of the redundant servers and perform a test to verify that fail-over occurs.

The Load Balancers configured for Web and App each forward incoming requests to their respective VMs while performing health checks.

When configuring a Health Check, you can set the check interval, timeout, and number of attempts.

By default, it is set to a period of 5 seconds, a wait time of 5 seconds, and a detection count of 3, and it detects failures and performs Fail-over, which takes 30 seconds [(period 5 seconds + wait time 5 seconds) * detection count 3].

Users can modify values for each item, setting values of 1–2, 147, 483, 647 seconds, and if set to the minimum, the fault detection time can be reduced to as low as 6 seconds.

Unlike the previous scenario, fail-over is converted to an automated action, so a simple failure can be recovered within the availability range.

However, if a VM or database is lost, it still takes time to recover the data.

And when conducting a load test, you must create a new VM from the image and register it directly with the load balancer’s server group.

For databases, we respond to load by changing the server type.

The following architecture targets a monthly availability of 99.99%, with an MTTR of approximately 4 minutes based on a 730‑hour month.

Diagram
Figure. Availability target 99.99% architecture

Implemented the VPC and each service on a Multi-AZ basis.

The VMs for Web and App were deployed using metric-based auto-scaling, and the servers were configured with a replica added to offload read workload from the database.

The availability of each component deployed across Multi-AZ is designed to meet the service’s required monthly availability of 99.99%.

The steps for conducting availability testing can be organized as shown in the table below.

Itemtest itemEstimated time
Virtual Server
Fault detection and response test
Load Balancer Health Check Failure Detection and Failover30 seconds
(default settings)
Database
Failure detection and response test
After stopping the Active DB, switch to the Standby DBseconds ~ minutes
Resilience testRestore Virtual Server from backup
Create new database
Recover database from backup
Apply data changes up to the point of failure after the backup
Maximum 60 minutes
Load testAuto-Scaling new server creation
Change Database server type
Create additional Replica based on database read load
up to 5 minutes
Table. Availability target 99.99% – Example of expected recovery time

The difference from the previous scenario is that resources are deployed across Multi-AZ, and Auto-Scaling has been implemented to automatically scale the servers horizontally up or down based on load changes.

Stop one of the redundant components to conduct a failover test, then incrementally increase the load and compare the server provisioning speed with the rate of load increase.

If the load increase rate exceeds the server creation rate, lower the scale‑out policy threshold to trigger expansion earlier and perform adjustments to increase the number of servers created.

Configure the database read load to be performed on the replica in advance.

During load testing, measure the database latency at each stage, and after the test, adjust the server type to an appropriate capacity.

Conduct periodic review

  • Planned Maintenance Test component and service flows during the regular maintenance (Maintenance) period when components are updated or security patches are applied. Perform tests on component changes and verify whether new bottlenecks arise in the overall service flow.

  • Unplanned Failure If an unexpected service interruption occurs, after confirming the incident, you must assess the recovery time and proceed with restoration according to the predetermined priority. Next, identify the root cause of the service interruption and resolve it. When the root cause analysis is complete, document the root cause, the solution, and the preventive measures to avoid recurrence. If a prolonged service outage is required to resolve an issue, take action in accordance with the scheduled regular maintenance window and implement the pre‑prepared emergency measures before the maintenance begins. We also collect logs to perform corrective actions. After resolving the issue, perform unit and integration testing to maintain overall availability.

Monitoring and Alerts

Monitoring and Log Collection

Best practice
Configure metrics for all possible components, apply monitoring, and collect logs.

After selecting and collecting key metrics from all components, we link the analysis results to notifications to ensure workload stability and an optimal user experience.

We perform real-time monitoring based on metrics that meet availability requirements, collect time-series logs, and integrate the analysis results with alerts to detect failures in advance and prepare for them.

Monitor components at all layers, and when necessary define key metrics and extract those metrics based on log data.

Design Principles
  1. Set metrics and goals to maintain availability.
  2. Enable monitoring for all available services and configure the dashboard.
  3. Enable logging for all possible services, and manage the collected logs in a central repository.

If a metric falls below the availability target, take the necessary actions.

For example, if the average CPU utilization of a specific Virtual Server exceeds 90%, the probability of a failure on that server increases.

If it is a single server, replace it with a server that has increased capacity, and if it is a group of servers configured with a Load Balancer, add new servers to reduce the overall load.

Diagram
Figure. ServiceWatch concept diagram

Notification and Response Automation

Best practice
When a metric fails to meet the availability target or a failure occurs, send a notification to the administrator and preconfigure automatic response actions.

Service disruptions can occur not only during working hours but also at any time.

If the metric is detected as not meeting the availability target, promptly notify of the failure condition and enable recovery actions.

Design Principles
  1. When the metric reaches the threshold, set an Event to automatically send a notification.
  2. Prepare the necessary actions for each Event in advance, and execute the actions immediately upon receiving the notification.
  3. Configure response actions that can be implemented automatically to ensure the necessary response is carried out.

The risk level of an Event can be divided into three categories.

Risk levelExplanation
FatalThis is the highest level of risk.
Generally, this level is set for very dangerous situations.
WarningThis is a medium-level risk.
Generally, when a situation causes a problem in the system that requires resolution, set it to this level.
InformationThis is the lowest level of risk.
It also includes simple notification-level information, and is set to this level when a situation generally requires reference or verification.
Table. Event Risk Level