The page has been translated by Gen AI.

운영 설계

운영 설계

Resource Management and Optimization

Cloud Resource Management

Cloud resource management optimization is redefined as a continuous activity that goes beyond simple technical capacity management to maximize cost efficiency relative to business outcomes.

The pay-as-you-go cloud model can lead to unnecessary cost wastage if neglected, requiring a strategic approach that combines technical capabilities with financial operations (FinOps).

It is important to consider elastic design from the initial design phase to enable applications and infrastructure to respond flexibly to changes in demand, minimize unnecessary resource allocation, and improve cost efficiency.

Elastic design leverages the scalability of the cloud environment to provide the flexibility to quickly increase or decrease resources as needed.

Continuous optimization is a key element of cloud operations, identifying resource usage patterns and improving inefficiencies through continuous monitoring and analysis.

To achieve this, we utilize automated tools and processes to monitor resource usage in real-time, reduce unnecessary costs, and optimize resource allocation through data-driven decision-making.

FinOps emphasizes the financial aspect of cloud resource management and achieves business goals through cost management and budget control.

Manage cloud costs transparently and maximize value for cost to utilize resources efficiently, operate within budget, and minimize unnecessary spending.

This helps maintain organizational financial health and strengthens cost management and financial accountability in the cloud environment. (Note: FinOps is covered in more detail in Cost Optimization Principles.)

This approach contributes to efficiently managing cloud resources, maximizing cost efficiency, and improving business outcomes.

Cloud Resource Management

Most cloud service providers, including Samsung Cloud Platform, set capacity limits on resources.

These restrictions appear as usage limits for specific resources, such as storage capacity and the number of creatable services, and are measures to prevent specific customers or projects from monopolizing resources.

When designing a cloud architecture, you must check capacity limits considering the following factors.

  • Identify the project’s capacity requirements in advance to prevent unexpected resource consumption limits.

  • Establish a resource auto-scaling plan considering the possibility of sudden load spikes.

  • Review the capacity limit regularly to keep it up to date.

You can check service limits in the Samsung Cloud Platform Console (Console > Management > Quota Service).

Some services can request an increase in provided capacity. Review capacity limits prior to resource deployment and submit a request in advance if capacity adjustment is possible.

Average Usage Trend Analysis and Capacity Adjustment

Unlike on-premises environments, cloud providers directly manage physical equipment and facilities in cloud environments.

This enables users to reduce the time spent on equipment procurement and installation, allowing them to focus on service deployment and utilization.

By leveraging the cloud, you do not need to prepare or maintain unnecessary idle resources in advance, and you can respond flexibly by creating or scaling virtual machines (VMs) when needed.

Since it is a pay-as-you-go structure, you can optimize costs by including only the excess capacity required during peak traffic times.

However, since the required capacity can be created and used immediately in such environments, it is crucial to manage resource usage efficiently.

To do this, you must include a procedure to analyze average usage trends.

To begin this analysis, you must first identify key cloud workloads.

To evaluate the average and maximum utilization of the workload, as well as current and future capacity requirements, you must fully understand the workflow and planning processes of the department using the workload.

It is also necessary to analyze past load patterns.

For example, you can derive patterns using data such as peak usage over the last 3 months, peak usage by time of day, and peak usage per minute.

Using Cloud Monitoring tools, you can analyze the following load patterns.

  • Average and maximum utilization

  • Sudden changes in usage patterns

  • (Due to changes in business circumstances) a surge during a specific period

If a sudden load increase event is planned in advance, it is advisable to review over-provisioning beforehand to respond smoothly to the traffic increase.

Additionally, a procedure is required to verify that a setting is in place to automatically send notifications to the administrator when resource usage approaches the limit.

Cloud Resource Optimization

Cloud resource optimization is a continuous practice and process for operational excellence, and a strategy to improve cost efficiency, maintain service performance, and maximize the benefits of the cloud environment.

To achieve this, you must approach it from various aspects, including computing, storage, idle resource management, and cost model optimization.

Compute optimization involves selecting the optimal instance type for your workload by leveraging statistically significant metrics, rather than simply relying on average CPU utilization.

For example, reduce unnecessary costs and utilize resources efficiently by using metrics such as P95 CPU usage below 30% over the last 14 days.

In this context, P95 CPU usage refers to the ‘95th Percentile’ in CPU performance testing or monitoring. It is an indicator representing the performance of the slowest 5% of total CPU usage data, measuring the system’s ‘stability under maximum load’.

This demonstrates how much actual user response times or system load increase when the CPU operates at maximum capacity (approaching 100% utilization). While a P95 in the 80-90% range may be typical, a value close to 100% could indicate a bottleneck or overload.

Idle resource management is the process of identifying and removing unused resources that only incur costs.

Establish a policy to regularly inspect and delete Unattached Disks/Volumes, Old Snapshots/Images, etc., to reduce costs.

Finally, cost model optimization is a strategy to discount the bill itself through financial optimization (FinOps) after technical optimization.

Reduce costs by combining various purchasing options, such as pay-as-you-go, commitment plans, and large resource discounts.

By continuously monitoring and improving these strategies, you can increase cost efficiency through cloud resource optimization, maintain service performance, and maximize the benefits of the cloud environment.

Automation Process Design

IT operations is characterized by the need to actively adapt to rapid technological changes in hardware and software provided by various vendors.

Many organizations are building hybrid cloud or multi-cloud systems, and cases of operating on-premises and cloud environments simultaneously are also gradually increasing.

In recently developed systems, technologies such as Microservices run together, and an architecture where millions of devices connect to these services via the internet is becoming widespread.

In such a complex environment, IT operators must perform multiple tasks simultaneously, so it is realistically difficult to handle all tasks manually.

Since the operations team is responsible for continuously operating the service and recovering quickly when an event occurs, it is essential to have a prepared response system in place.

An approach that predicts the likelihood of failures and prepares proactively is more effective than responding after a failure occurs.

For stable and rapid operations, it is recommended to automate processes, and operational areas where automation can be implemented can be found in the following table.

areaExplanation
Pipeline definition, execution, and managementAutomate CI/CD pipelines and execution using DevOps Service
DeploymentAutomate infrastructure deployment and updates using IaC(Infrastructure as Code) tools such as Terraform.
testIn the DevOps Service pipeline, you can automate testing using SonarQube to reduce workload.
ResizeUse Auto-Scaling or Kubernetes Engine Node Pool autoscaling to automatically adjust infrastructure size as load increases or decreases.
Monitoring and AlertsConfigure Cloud Monitoring to register threshold-based events so that an alert is triggered when a metric exceeds the threshold range.
표. 자동화 적용 운영 영역

To maximize cloud-based operational efficiency, it is necessary to implement container-based automation at the infrastructure level and configure a pipeline covering the coding, build, integration, release, and deployment stages at the application level.

A representative process that facilitates such automation implementation is DevOps, which is a methodology that enables the continuous delivery of products or services through collaboration between development and operations teams.

DevOps aims for a structure where development and operations teams collaborate and share responsibilities, exchanging continuous feedback throughout the entire development lifecycle from building to deploying applications.

In such DevOps environments, the following tools and automation elements are utilized.

  • Infrastructure as Code

  • Centralized code repository

  • CI/CD(Continuous Integration and Continuous Deployment) pipeline

First, let’s examine the code-type infrastructure.

Code-based infrastructure provisioning and management

The operations team can significantly improve operational efficiency by leveraging automation technology.

In particular, by leveraging the Infrastructure as Code (IaC) approach, you can automate tasks such as launching new servers or starting and stopping services, reduce repetitive infrastructure management tasks, and invest more time in strategic planning.

When deploying and managing IaC, declarative tools are considered more suitable for production environments than imperative tools.

Declarative tools specify the deployment completion state in a definition file and operate to maintain that state.

Even if a failure occurs during deployment, the system repeatedly performs tasks to achieve the declared state, and if changes arise due to an outage, it automatically restores the original state.

Thanks to these characteristics, operator intervention is reduced, enabling stable infrastructure operations.

In contrast, since imperative tools operate by instructing deployment tasks step-by-step, operators often need to intervene directly in the event of failures or changes.

Samsung Cloud Platform supports Infrastructure as Code operations and provides CLI and Open API.

By leveraging this, you can automate resource deployment using Terraform, an IaC tool, as shown in the figure below.

![Configuration Diagram](../img/img_operation_04.png ‘Figure. Infrastructure Deployment Automation using Terraform

After generating a Terraform authentication key and completing the environment configuration on Samsung Cloud Platform, write the resources and environment configuration to be deployed as code, and deploy VMs to the kr-west1 and kr-east1 regions using Terraform commands.

Central Code Repository Implementation

Samsung Cloud Platform’s DevOps Service includes built-in tools for storing source code and managing configurations.

You can configure a Private Git Repository for each project to set user-specific access permissions, enabling secure source code sharing, development collaboration, and CI/CD environment configuration.

The following figure illustrates the Application deployment automation architecture using DevOps Service.

![Configuration Diagram](../img/img_operation_05.png ‘Figure. Automating Application Deployment using DevOps Service

When a user pushes source code to Git, the Jenkins pipeline triggers, automatically deploying the built Application or container object configuration declaration file to Kubernetes Engine.

By storing not only source code but also infrastructure configuration declaration files in a central repository and managing operations, you can manage changes in an integrated manner and restore to a previous state when necessary using version control features.

Using Continuous Integration and Deployment (CI/CD)

Through Continuous Integration (CI), developers frequently commit to the code repository and automatically perform builds and tests.

In this process, unit tests and integration tests run automatically, and both automated and manual tests are conducted before deploying to the production environment.

Deployment to Staging and production environments is allowed only if all tests pass.

Continuous Integration (CI) refers to the process of automating the build and unit test phases in the software development lifecycle, and every update committed to the code repository generates an automated build and test.

Continuous Deployment (CD) extends the continuous integration process to include the procedure of automatically deploying builds that have passed testing to the production environment.

The following figure shows the CI/CD process of Samsung Cloud Platform.

Conceptual Diagram
Figure. CI/CD based on Samsung Cloud Platform DevOps Service

In actual CI/CD environments, multiple people write code collaboratively, and all developers must work based on the latest build.

To do this, configure a central code repository to manage multiple versions of the code and enable developers to access it and collaborate.

Developers check out code from this repository, make changes or write new code in a local copy, and after testing, commit back to the repository.

Continuous Integration (CI) automates most of the software release process, but the final deployment to the production environment is typically triggered directly by developers.

Continuous Deployment (CD) is a concept that extends Continuous Integration (CI) by automatically deploying all code changes to a testing or production environment after the build phase.

The figure below shows the CI/CD architecture of the development, test, and operation environments implemented on Samsung Cloud Platform.

Configuration Diagram
Figure. DevOps Service-based Development, Testing, and Operations CI/CD
  1. After applying for DevOps Service, configure source code repositories and CI/CD tools in the user area and integrate with the DevOps Console.

  2. The developer pushes the development source code to the developer branch of the source code repository.

  3. To deploy source code to the development environment, submit a Pull Request from the developer branch to the Dev branch.

  4. Merge into the Dev branch after the review phase.

  5. After merging the Dev branch, the CI/CD build pipeline runs, proceeding through the source code checkout → build → container image build process.

  6. The built container image is stored in Container Registry.

  7. The new version image is deployed to the container in the target cluster (dev).

  8. The developer accesses the frontend container to verify the deployment results.

  9. After verifying Dev → Stage, merge into the Production branch for the production release, and repeat steps 4 through 8 for the Production cluster.

  10. Perform a release using the blue-green deployment method and provide services to the End User.

Cloud Deployment

Cloud deployment is a critical process for applying system changes and is one of the primary causes of failures.

Operational excellence depends largely on how you design and execute these faultless deployments.

To achieve this, systematic strategies and tools such as automation, monitoring, and rollback capabilities are essential.

By building Continuous Integration (CI) and Continuous Deployment (CD) pipelines, you can efficiently manage the process from development to deployment and apply changes quickly and safely.

This plays an important role in maintaining system availability and reliability.

Blue-green deployment and canary deployment are widely used as deployment strategies.

Blue-green deployment operates both the new and existing versions simultaneously, allowing for a quick switch back to the previous version if an issue occurs, ensuring high stability.

Canary deployment offers the advantage of enabling early detection and resolution of issues by gradually rolling out a new version to a small group of users.

These strategies must be designed considering the elasticity and scalability of the cloud environment.

In cloud environments, you must ensure system stability through resource allocation, load balancing, and disaster recovery plans.

Additionally, you must establish a deployment process that meets security and compliance requirements to ensure data protection and regulatory compliance.

By comprehensively considering these elements to design and execute cloud deployment, you can maximize operational efficiency and service quality.

Establishing Infrastructure and Application Deployment Strategies

After configuring infrastructure and Application deployment automation, you must review the deployment strategy.

The most commonly used deployment strategies are as follows.

Deployment StrategyExplanation
In-place deployment
In-place deployment
Performing an update on the current server
Rolling deployment
Rolling deployment
Gradually deploy the new version to the existing server fleet.
Blue-green deployment
Blue-green deployment
Gradually replace the existing server with a new server.
Red-black deployment
Red-black deployment
Immediately switch from the existing server to the new server.
Immutable deployment
Immutable deployment
Built a completely new server set
표. 배포 전략

In-place Deployment

In-place deployment is a method of deploying a new Application version directly to the existing server set.

Since updates are performed via a single deployment operation, a certain amount of downtime may occur.

However, the deployment process itself proceeds relatively quickly because it requires minimal infrastructure changes and does not require modifying existing DNS records.

If deployment fails, redeployment is the only way to recover.

Rolling Deployment

Rolling deployment is a method of dividing a set of servers into multiple groups and performing updates sequentially.

Since there is no need to update all servers simultaneously, the old and new versions can coexist even during deployment.

This minimizes downtime and is considered a favorable strategy for achieving zero downtime.

Even if a new version deployment fails, only some servers are affected, so the overall service is not significantly impacted.

However, deployment time may be slightly longer than the in-place method.

Blue-Green Deployment

In the blue-green deployment strategy, ‘blue’ refers to the existing production environment where actual user traffic is routed, while ‘green’ refers to the parallel environment where the new code version is applied.

When deploying, switch user traffic from the blue environment to the green environment; if a problem occurs in the green environment, you can roll back by switching traffic back to the blue environment.

DNS cutover and Auto Scaling group policy are used as common methods to switch traffic in blue-green deployments.

Using an Auto-Scaling policy, you can gradually replace existing instances with VMs hosting the new version of the Application as the Application scales.

This method is particularly suitable for minor releases and small code changes.

Another approach is to perform switching between different versions of the Application using GSLB policies.

The diagram below illustrates the architecture for creating an environment to host the new version of the Application and using GSLB to divert a portion of traffic to the new environment.

Configures the GSLB policy using the Ratio algorithm and performs a gradual transition from the blue environment to the green environment by adjusting the weight ratio.

Configuration Diagram
Figure. Blue-Green Deployment

Red-Black Deployment

Red-Black deployment is similar to Blue-Green deployment, but differs in the DNS switching method.

Red-Black deployment switches to the new version by changing DNS at once, without a gradual transition.

Consequently, in blue-green deployment, the previous and new versions coexist for a certain period, whereas in red-black deployment, only one version is active.

This strategy is suitable when a rapid transition is required.

Immutable Deployment

Immutable deployment is a suitable strategy when there are unknown dependencies in the application or when the infrastructure is complex.

Applications that have been established for a long time and updated repeatedly become increasingly difficult to upgrade over time.

In immutable deployments, you terminate the existing VM and deploy a new VM while releasing a new version.

When designing infrastructure and Application deployment strategies, you must consider not only downtime but also cost.

Estimate costs based on the number of VMs to replace and deployment frequency, and select the most suitable deployment method by comprehensively considering budget and downtime.

To achieve high-quality deployments, Application testing must be performed at every stage, which requires considerable effort.

The CI/CD pipeline discussed above can help automate the testing process and improve the quality and frequency of feature releases.

Using Version Management

Automating infrastructure and Application deployment increases deployment speed and frequency.

While this is a desirable development in terms of business agility, it requires the operations team to be more meticulous in managing deployments.

Issues such as failures or data loss may occur, or you may need to track the history of past deployments. Therefore, it is important to establish a management system for this.

Version management contributes to reducing the risk of loss by enabling recovery to a normal state or previous data.

In Samsung Cloud Platform, you can configure version management at the following three levels.

Data Version Control Based on Object Storage

You can enable versioning for object storage in Samsung Cloud Platform’s Object Storage.

You can restore the previous version of an object even if you overwrite or delete it.

Object Storage versioning is similar to backup or synchronization in terms of preserving data, but it differs in how it preserves existing data.

Backups preserve data at a specific point in time, while synchronization reflects all data changes to the replication target but does not preserve existing data.

Object Storage versioning saves a new version whenever an object is modified by receiving POST, PUT, or DELETE requests, while preserving the existing version.

Git-based source code version control

You can perform configuration management of source code using Git tools.

You can manage development code versions using all types of Git commands and features, such as saving source code, viewing commit history, tracking changes, checking history, and receiving change notifications.

Developers fetch source code from Git tools and work on their local PCs. When the work is complete, they perform a commit to the repository.

When multiple developers collaborate, changes can be tracked through version control, as shown in the following figure.

Conceptual Diagram
Fig. Git-based source code management

Server Image Versioning

You can systematically manage server versions by managing Virtual Server images.

You can use this image to deploy a new Virtual Server at a specific point in time, or register it in a Launch Configuration to use it as an Auto-Scaling server.

Container images used in Kubernetes Engine are stored in Container Registry, and these images are used when deploying pods internally.

When an update occurs to the Application, you can register it as a new version and use it for deployment.

Existing versions can also be stored and managed in the Registry, and if necessary, you can view and restore images from previous versions.

Testing and Rollback Automation

Operational optimization is a core procedure that must be performed continuously.

Continuous effort is required to identify and improve changes, and achieving operational excellence in this regard is not a short‑term goal but a continuous journey.

Change is inevitable when maintaining workloads, and system changes occur for various reasons, such as applying security patches, upgrading software, and addressing compliance requirements.

You must design workloads to allow all system components to be updated regularly to maintain the latest state and reflect critical updates.

To avoid large impact, you must automate procedures to enable small changes.

All changes must be reversible, allowing the system to be restored even if issues arise.

Applying such small-scale changes makes testing easier and helps improve overall system stability.

Additionally, you must automate change management to reduce human errors and achieve operational excellence.

The following figure is an example of an environment configuration for testing a new version of the Application, demonstrating a case of A/B testing that provides two or more feature versions to different user groups.

Configuration Diagram
Figure. Canary deployment

You can perform user testing by configuring the Ratio algorithm in GSLB to send 90% of total requests to the existing Application(v1) and distribute 5% each to the newly developed v2 and v3.

This allows you to verify the stability of the Application and server infrastructure.

You can use the same method for full-scale deployment.

By selecting the Blue-Green deployment strategy, you can test the impact by sending a portion of user requests to the Green environment, which is called Canary analysis.

If there is an issue with the new version (v2), you can immediately announce it and switch traffic to the existing version (v1) before it seriously affects users.

Test the green environment to see if it can handle the load while gradually increasing traffic.

Limits the blast radius by monitoring the green environment to detect issues and providing an opportunity to shift traffic back.

If all metrics are normal, terminate the blue environment (v1) and release resources.

Blue-Green deployment achieves zero downtime, provides easy rollback, and allows users to specify the deployment time if necessary.

Cloud Monitoring and Log Analysis

Cloud monitoring and log analysis are essential tools for understanding system status and troubleshooting.

Traditional monitoring simply tells you what is broken (e.g., CPU 100% usage), whereas modern cloud observability provides in-depth data that enables debugging of the root cause of failures.

This enables you to understand system complexity, identify root causes, and respond rapidly and accurately.

Monitoring and log analysis contribute to enhancing the efficiency, stability, and security of cloud operations.

By analyzing resource usage patterns, you can optimize costs, eliminate performance bottlenecks, and identify security threats early.

Additionally, by leveraging automation tools and AI to perform real-time log collection and analysis, you can achieve operational optimization more quickly and accurately.

This approach effectively manages the dynamic and complex nature of cloud environments and supports business continuity and innovation.

Monitoring and Alerting Configuration

Operational excellence is determined by the ability to respond to and recover from failures quickly through monitoring.

To do this, it is important to understand the operational status of the workload by assessing the impact of events and corresponding responses on the workload, and to determine the system’s operational status using event metrics and dashboards.

Samsung Cloud Platform’s Cloud Monitoring service collects usage status, change information, and logs for operating infrastructure resources, and if a configured threshold is exceeded, it generates an event and sends a notification.

Through these features, users can respond quickly to performance degradation or failures and predict when resource capacity expansion is required, ensuring a stable computing environment.

The instrumentation features provided by Cloud Monitoring are broadly divided into two categories: performance data collection and event processing, and log data collection and event processing.

![Conceptual Diagram](../img/img_p_Cloud_Monitoring.png ‘Figure. Cloud Monitoring data collection diagram

Tracking system state is essential for understanding workload behavior.

The operations team detects component anomalies using system status monitoring and takes action accordingly.

It is generally easy to think that monitoring is limited only to the infrastructure layer, such as tracking server CPU and memory usage, but in reality, monitoring must be applied to all layers of the architecture, including the Application, network, and database.

The following table lists the services for which you can configure monitoring by connecting Cloud Monitoring.

Categoryservice groupService
Performance MonitoringComputeVirtual Server, GPU Server, Bare Metal Server, Multi-node GPU Cluster
Performance MonitoringStorageBlock Storage(VM), Block Storage(BM), File Storage, Object Storage
Performance MonitoringDatabaseEPAS, PostgreSQL, MariaDB, MySQL, MS SQL, CacheStore
Performance MonitoringContainerKubernetes Engine, Container Registry
Performance MonitoringNetworkingVPC, Load Balancer, VPN, Global CDN, Direct Connect, Cloud WAN
Performance MonitoringData AnalyticsSearch Engine, Event Streams, Vertica
Log monitoringComputeVirtual Server, GPU Server, Bare Metal Server, Multi-node GPU Cluster
Log monitoringDatabaseEPAS, PostgreSQL, MariaDB, MySQL, MS SQL, CacheStore
Log monitoringContainerKubernetes Engine
Log monitoringData AnalyticsSearch Engine, Event Streams, Vertica
표. Cloud Monitoring 대상 서비스 목록

To configure notifications in Samsung Cloud Platform Cloud Monitoring, you must define events.

An event is a configuration that notifies the user when the performance metric of a monitoring target reaches a predefined threshold.

When configuring events, specify the monitoring target, performance item, measurement type/unit, risk level, and threshold.

You can also send notifications by specifying the recipients.

Log Recording and Collection

Log collection and analysis are performed for post-incident response when issues occur.

This is because the approach of analyzing the root cause provides the most rapid clue to solving the problem.

If you correctly identify the problem, you can find and apply an efficient solution.

Samsung Cloud Platform provides 1GB of storage space for log collection. If this limit is exceeded, logs are automatically deleted starting from the oldest.

To set up log collection in Cloud Monitoring, the log agent must be installed on the collection target.

Network Logging collects logs from Firewalls, Security Groups, and NATs and stores them in Object Storage, allowing you to analyze traffic to and from the VPC.

If Cloud Monitoring and Network Logging are records of system-related activities, Logging & Audit are records of cloud and user activities.

For example, if a user logs in to the Console and creates a Virtual Server, this activity is recorded in Logging&Audit.

If you configure a Trail, you can preserve these logs for the long term without time constraints.

Collecting data from various resources to make decisions and predict potential problems is not a prerequisite for operations, but it critically contributes to improving operational quality.

This helps predict potential future failures and prepares the team to respond appropriately.

Implement a mechanism to collect logs for job events, various activities across workloads, and infrastructure changes to create detailed activity tracking and maintain activity records for auditing purposes.

Large-scale organizations generate vast amounts of log data across numerous systems, and to gain insights from this data, a mechanism is required to collect and store logs and event data over a specific period.

Log Analysis and Improvement Activities

You can drive improvements by analyzing monitoring logs built using Samsung Cloud Platform tools and solutions.

Regular log analysis allows you to improve cloud operational efficiency, optimize costs, and enhance security.

In cloud environments, various resources are dynamically created and deleted, generating large amounts of log data. By effectively analyzing this log data, you can detect potential issues early and derive improvements to optimize operations.

First, you can optimize costs by identifying resource usage patterns through cloud log analysis.

For example, if data reveals that unused resources remain active or that costs are surging due to unexpected traffic increases, you can reset Auto-Scaling policies based on this data or reduce costs through commitment pricing.

Next, it can also be used for performance improvement.

Log analysis in cloud environments plays a crucial role in monitoring and optimizing system performance.

By analyzing log data, you can identify Application response times, database query speeds, and network bandwidth usage. This allows you to pinpoint areas where performance degradation occurs, eliminate bottlenecks, or optimize resource allocation, thereby improving overall system performance.

Log analysis is also essential for identifying and responding to security threats.

Trail logs contain user access and operation records, and you can identify abnormal network activities through VPC log analysis.

Continuously analyzing log data allows for the rapid detection of abnormal login attempts, data exfiltration attempts, and malicious activities. By identifying signs of anomalies early and taking appropriate response measures, you can minimize security threats.

Short-term log analysis can be performed using existing tools, but specialized log analysis tools or artificial intelligence can be leveraged to gain better insights.

You can achieve operational optimization more quickly and accurately by performing log analysis using automation tools and artificial intelligence (AI).

Utilizing these log analysis tools enables real-time log collection and analysis, providing immediate alerts or automatically performing remediation tasks when issues occur.

Event Handling

Event Grade Definition

Business impact is evaluated through task structuring, identification of key tasks, and analysis of task interdependencies (refer to Reliability Design Principles for details), and based on this, you can define the importance of events for the identified key systems.

In Cloud Monitoring, you can configure events and classify them by severity.

Severity levels can be specified as Fatal (the highest level), Warning (the intermediate level), and Information (the lowest level), and event occurrence frequency can be visualized based on each severity.

You can identify critical monitoring information without missing it by configuring events.

For example, if you configure an event to trigger whenever an overload-related performance metric exceeds a specific threshold, the user receives a notification whenever there is a risk of overload during the operation of that resource.

Based on this, operators can proactively respond before issues arise.

Event Management Process

If a failure occurs, or if a Fatal event equivalent to it occurs—even if it does not lead to an actual failure—you must take prompt action.

To achieve this, the event management process must be defined in advance, enabling rapid identification of issues and the taking of appropriate response measures.

The following figure is an example of the failure management process described in the reliability design principles.

![Conceptual Diagram](../img/img_operation_10_1.png ‘Figure. Incident Management Process Example

Event Response Automation

Perform response actions based on predefined processes for rapid event response and configure event response automation to reduce the time required from incident identification to response.

For example, let’s assume a DDoS attack is launched against the server by an external attacker, as shown in the figure below. The goal of the DDoS attack is to incapacitate the server by generating excessive traffic, preventing legitimate users from using the service.

In such cases, the best approach is to configure DDoS Protection to detect and defend against attacks.

Configuration Diagram
Figure. Automating DDoS Attack Mitigation Using Auto-Scaling

An architectural approach to respond to these attacks is to configure Auto-Scaling on the Virtual Server and set a policy to increase the number of servers based on load.

This implements an automated response to ensure that legitimate users are not completely blocked from using the service.

Additionally, configure thresholds and alerts for metrics such as Network In or CPU usage in Cloud Monitoring to ensure that alerts are sent to administrators.

While the administrator receives notification of the attack and takes action, the automated measures of Auto-Scaling ensure service continuity, and the administrator can quickly identify the attacker’s IP and protect the service by configuring a deny policy for that IP address in the firewall.