The page has been translated by Gen AI.

Outage Planning

The reliability design principle focuses on minimizing data loss and enabling the ability to restore services as quickly as possible in abnormal system operation situations such as failures or disasters.

If the availability design principle focuses on preparing automatic fault‑handling functions (Fail‑Over) in advance through high‑availability designs such as redundancy to address a failure of a single component, the reliability design principle deals with post‑incident response strategies for faults or disasters that have already occurred.

This reliability design primarily targets unplanned service interruptions and focuses on securing resiliency (Resiliency) when some or all components of an information system reach a failure or outage state that is difficult to recover from.

Depending on the type of cause for a service interruption, the corresponding recovery measures inevitably differ.

In this document, we categorize the causes of service interruption as ‘Failure’ and ‘Disaster’, and provide detailed response measures for each.

First, failure is a concept that focuses on controllable factors from the perspective of information technology service management.

This does not include uncontrollable factors such as natural disasters or human-caused disasters.

In other words, it refers to the degradation, errors, and failures of an information system caused by controllable factors that have a direct impact, such as human failures, system failures, and infrastructure failures (including operational failures, equipment failures, etc.).

In contrast, disaster refers to the interruption of information technology services caused by events occurring outside of information technology that are difficult to prevent or control.

Additionally, damage caused by an information system failure that results in an expected recovery time exceeding the acceptable limit and impeding normal business operations is also considered a disaster. (TTA, Information System Disaster Recovery Guidelines)

Category	Disaster	Disability
Location of the cause	IT-based external	IT-based internal
Prevention and Control	Impossible	Possible
IT-based damage scale	the entire site	Partial within the site
Level of the response organization	enterprise-wide level	Information Systems Management Department level
Estimated system recovery time	Medium, long-term (several days or more)	Short-term (a few hours)

Table. Disasters and Failures

Among the various types of failures, some can be restored to normal condition within a relatively short time, and when they occur in low‑priority tasks, immediate recovery may not be required.

However, some failures not only directly affect core operations such as customer service, but if they persist for a long time, they can cause not only financial losses but also serious damage to the organization’s external image.

For this reason, high‑priority incidents require a more focused management and response system in addition to the standard incident management procedures.

In incident management, Emergency Situation refers to a scenario where a failure occurs in a system that has a broad impact on operations and requires rapid recovery, and where recovery within the allowed time is difficult, potentially leading to an uncontrolled disaster.

To effectively respond to such emergencies, it is most important to establish a response plan in advance for when an emergency occurs.

Concept diagram — Figure. Connection between typical failures and emergency situations (TTA, Information System Failure Management Guidelines)

When a failure occurs, the first thing to do is to quickly assess its severity.

The severity of a failure is expressed as a failure grade, which is determined based on the impact of the failure on core operations and the urgency of recovery.

At this stage, you should pre-estimate the recoverability and expected recovery time for each fault grade, and use this information to decide whether to declare an emergency.

The classification of these incident severity levels must be derived from objective criteria to clearly share the incident situation with stakeholders and respond appropriately.

If it is determined that recovery cannot be completed within the allowed time, declare a ‘disaster’ and follow the procedures outlined in the pre‑established disaster recovery plan.

At this point, the allowable recovery time can vary depending on the organization’s characteristics, and in certain industry sectors, the higher supervisory authority may also set standards.

For example, the Financial Supervisory Service recommends the total recoverable time (recovery target time), including disaster recovery, for each financial institution as follows.

Major financial institutions are being advised to achieve full recovery within three hours after a disaster.

institution	Recovery time	Remarks
Banking and securities	within 3 hours
Financial Shared Network Operating Institution, Public Certification Center	within 3 hours	Korea Financial Telecommunications & Clearings Institute, Securities IT
Securities-affiliated institution, integrated system operating agency	Within 3 hours	Stock Exchange, Futures Exchange, KOSDAQ Market Securities, Securities Depository, Treasury Association
insurance	within 24 hours	including foreign insurance companies
foreign financial institution	Autonomy	Disclose recovery time, submit emergency response plan
Other financial institutions	Autonomy	Emergency Response Plan Submission

Table. Recovery time by financial institution (TTA, Information System Incident Management Guidelines)

Backup Policy Configuration and Automation

Best practice

Establish backup policies based on business criticality and automate backup execution.

Backup is the most common data protection measure, meaning copying data to a secure separate storage device to guard against damage or loss of original data caused by server failures, power outages, earthquakes, other disaster situations, external attacks, and tampering.

Backup is a critical component of an organization’s data protection and recovery strategy, performed regularly to minimize data loss.

Backup intervals are designed based on the allowable data loss period according to business importance.

Backups should be configured to be created automatically based on a regular schedule or changes to the data set, enabling the organization to minimize data loss and optimize the recovery point.

Important data sets have a small tolerance for loss, so they should be automatically backed up frequently.

Conversely, data of low importance that can tolerate some loss can be backed up at a lower frequency.

When designing a backup policy, you must always consider the backup window.

Backup window refers to the time during which a backup can actually be performed, and when determining it, you should consider the following two factors.

Minimize business impact during backup time Generally, for a daily backup policy, backups are performed from the end of the workday until the period before the start of work the next day. This is to prevent server load generated during backups from affecting actual business operations. For a weekly backup policy, configure it to perform backups using weekend time.
Validity during recovery of backed-up data Depending on the point in time when the data was backed up, the validity of that backup data during recovery may vary. For example, when performing a batch job, whether the backup point is taken before or after the batch job can affect whether additional steps are required during recovery. This also affects the time required for full recovery. Therefore, you should set an appropriate backup schedule considering the nature of the work.

Design Principles

Back up the Virtual Server using the Backup service.
Back up the database using the built-in backup feature of the Database service.
Enable snapshots and versioning to protect Storage data.

The Backup services and features provided by Samsung Cloud Platform are shown in the table below.

When designing the Recovery Point Objective (RPO), you can choose a storage based on the RPO or review the RPO by considering the storage’s backup schedule.

Backup target service	Backup function	Backup method	Backup schedule	Backup retention period	Backup Repository
Virtual Server	Backup service	Full or incremental backup of VM snapshots	day/week/month	2 weeks ~ 1 year	Samsung Cloud Platform Management Repository
Bare Metal Server	Backup service	File System Agent Backup	day/week/month	2 weeks ~ 1 year	Samsung Cloud Platform Management Repository
DBaaS	Built-in function	DB-based backup	Data: 1 day Archive: 5 minutes~1 hour	7–35 days	User Management Object Storage
File Storage	Built-in function	snapshot	day/week	Automatic:128 items Manual:800 items	File Storage internal
Object Storage	Built-in function	Version control	Immediately upon change	No restrictions	Inside the bucket

Table. Backup methods per Samsung Cloud Platform service

Server Backup Architecture

The architecture below is a server backup and backup DR architecture implemented on the Samsung Cloud Platform.

Diagram — Figure. Server backup and DR architecture

To back up a Virtual Server, create a Backup service; the Backup service snapshots the Virtual Server, stores the image as a backup, and can also distribute the backup to a remote location. Recovery is performed by creating a new Virtual Server from a backup copy at a specific point in time.
When you enable the DR option during backup creation, a backup copy is replicated to the DR site when the backup is performed on the primary site. The DR option cannot be enabled on an already created Backup service; to configure DR, you must create a new Backup service.
Bare Metal Server can be configured for agent-based backup.

Database Backup Architecture

The Database service provides database backup functionality by default.

Database backup performs both data backup and archive backup.

The backup must be stored in an Object Storage bucket that the user creates and designates as the storage location. When restoring a backup, do not restore the backup directly onto the existing server.
Create a new database from the backup. Convert the created database to the Master database.

Storage backup

Samsung Cloud Platform File Storage protects data using two methods: snapshots and disk backups.

Both methods use the File Storage repository as the source for backup copies and can generate backup copies through scheduling.

As shown in the figure, you can check the snapshot (/.snapshot) of the File Storage mounted on the server.

By checking the snapshot path, you can view the directories and files as they existed at the time the snapshot was taken, and you can locate and restore any directories or files that require recovery.

Object Storage provides a versioning approach that stores changed copies of objects instead of using backup methods to protect data.

When version control is enabled, all previous versions of an object are saved each time the object changes, allowing you to review the change history when needed.

Backup protection and encryption

Best practice

Safely manage backup copies to ensure data protection and integrity.

The primary purpose of backup is to protect an organization’s critical data from loss.

Critical data is determined by the organization based on business impact, primarily associated with tasks directly linked to the organization’s core services, and sometimes includes data that must be retained mandatorily by law.

The following table shows examples of the records that must be retained by corporations, public institutions, and medical institutions, along with their corresponding retention periods.

organization	Data	Shelf life	Evidence
company	Commercial ledgers and key business documents	10 years	Article 33 of the Commercial Code
company	Voucher	5 years	Commercial Code Article 33
company	Employee roster and employment contract documents	3 years	Labor Standards Act Articles 42 and 91
company	Corporate transaction ledgers and transaction documents	5 years	National Tax Basic Act Article 85-3
company	Dealer Agreement	3 years after the transaction ends	Act on the Fairness of Agency Transactions, Article 5
company	Documents related to subcontracting transactions	3 years after the transaction ends	Act on the Fairness of Subcontracting Transactions, Article 3
company	Industrial safety documents	3 years	Industrial Safety and Health Act
company	Personal information	Destroy immediately when unnecessary.	Personal Information Protection Act Article 21
Public institution	All types of recorded information materials—including documents, books, registers, cards, drawings, audiovisual materials, electronic documents, and the like—produced or received by public institutions in the course of their duties, as well as administrative artifacts.	permanent/semi-permanent/30 years/10 years/5 years/3 years/1 year	Act on the Management of Public Records and Its Enforcement Decree
medical institution	Medical record book / Surgical record	10 years	Medical Service Act Enforcement Rules Article 15
medical institution	patient register, radiographic images, test records, nursing records, etc	5 years	Medical Service Act Enforcement Rules Article 15
medical institution	prescription	2 years	Medical Service Act Enforcement Rules Article 15

Table. Examples of required retained documents and retention periods

In on-premises environments, backup copies are stored (distributed) in a secure remote location to prepare for site disasters.

In the cloud, to securely manage backup copies, configure access controls for the backup copies, maintain integrity by implementing encryption of the backup copies, and prepare for site disasters through backup DR(disaster recovery).

Design Principles

Archive Storage, Multi-AZ, DR configurations prevent loss of backup copies due to failures or disasters at a single point/site.
Ensures the integrity of backup data through access control and encryption of the backup repository.

※ Replication between Multi-AZ and Object Storage Regions is planned for future release (‘26).

Archive the bucket in Object Storage where the backup is stored to Archive Storage for long-term retention.
Deploy Object Storage across multiple AZs to prepare for failures in a single availability zone.
Implement DR replication to prevent data loss in the event of a regional disaster.
To prevent unauthorized access to backup copies, set access controls by specifying the access server, IP address, and endpoint, or
Enable bucket encryption to ensure data integrity.

Develop a recovery plan for failures

To prepare for server and data loss due to failures, you can establish the following recovery plan.

Step	Main activities
Step 1 Situation Assessment	Problem identification: Determine the cause of system downtime or data loss Impact assessment: Evaluate the scope of affected systems and data
Step 2 Prepare to execute recovery plan	Assemble recovery team: Secure specialized personnel to perform recovery tasks Verify recovery data: Check the backup policy, backup tools, and recent snapshot status of the target service Prepare recovery environment: Determine the network environment for recovery (existing network, new network)
Step 3 Perform Recovery	Cloud infrastructure configuration for recovery: VPC, Server Net, Security Group, storage Perform recovery: execute recovery using the selected backup copy
Step 4 Testing and Validation	Recovered data and system operation test: data integrity, Application functionality, network connectivity, etc Verification of normal business capability by actual users
Step 5 Normalization Report	After confirming the proper operation of all systems and data, normalize the system Prepare a recovery procedure and results report Record the methods for resolving issues that occurred during recovery work and future improvement measures

Table. Disaster recovery procedure

Data backup recovery test

Best practice

Regularly test recovery to verify that the target Recovery Time Objective (RTO) and Recovery Point Objective (RPO) are met.

Data backup and recovery testing is an important process to verify that recovery is performed correctly.

Even if backups are performed in accordance with information security regulations, without regular checks, it may be difficult to recover data as planned in the event of a failure.

Therefore, you should regularly perform recovery tests on backups to verify that the restored system operates correctly.

Design Principles

Check the original backup data and its replica to confirm that the automated backup was executed correctly, and verify data validity.
Set up an environment for recovery testing and conduct recovery drills.
If data recovery fails or does not meet the target RTO and RPO, perform backup verification tasks and improvements.