The page has been translated by Gen AI.

평가 및 개선

평가 및 개선

Process Definition for Continuous Improvement

When creating a process for continuous improvement, the first thing to define is roles.

Designate which members have the authority to perform necessary tasks and secure visibility into the improvement flow.

The following figure is an example of the DevOps Pipeline for the improvement process.

Flowchart
Figure. DevOps Pipeline Improvement Process Example

The tasks to be performed at each step are as follows.

stepExplanation
Create project / Add userCreate a DevOps project on Samsung Cloud Platform and add users.
Role definitionAssign a responsible user according to the task and grant permissions.
Commit changesCommit new or updated infrastructure configuration templates or application code to the code repository.
Build and test changesBuild and deploy the modified code in the test environment for testing.
Recreate the required test environment considering the conditions presented in the earlier errors or improvement items, and perform testing.
Deploy code to the production environmentDeploy the tested application and infrastructure configuration to the production environment after passing through staging.
MonitoringIncorporate improvements identified during the error and improvement phases into monitoring metrics.
Monitor the performance of measurement metrics.
Error, improvement itemsSummarize the errors and improvement items identified during the monitoring process.
표. DevOps 파이프라인 개선 프로세스

When configuring a process for continuous improvement as shown above, you must consider the task elements listed in the table below.

Improvement work itemsExplanation
Number of stepsIn CI/CD, it may include development, integration, system, user acceptance, and production.
Some organizations also include development, alpha, beta, and release stages.
Test types for each stageEach stage can perform various types of tests in the production phase, such as unit testing, integration testing, system testing, UAT, smoke testing, load testing, and A/B testing.
Test sequenceTest cases can be executed in parallel or sequentially.
Monitoring and ReportingMonitor system faults and failures, and send notifications when a failure occurs.
Infrastructure provisioningDefine the infrastructure provisioning method for each stage.
rollbackDefine a rollback strategy that reverts to a previous version when necessary.
표. DevOps 파이프라인 개선 작업 요소

Perform Post-Event Analysis

When a failure occurs during system operation, you must learn from mistakes and identify the problem.

Ensure that the same incident does not recur, and prepare a solution in case it recurs.

One improvement strategy is to conduct Root Cause Analysis (RCA), which helps analyze the root cause of an issue and prevent its recurrence.

RCA conducts problem analysis through the following five-step questions.

stepQuestion
Step 1
Problem Definition
  • What happened?
  • What are the specific symptoms?
Step 2
Data Collection
  • What evidence proves that there is a problem?
  • How long has the problem existed?
  • What is the impact of that problem?
| | Step 3
**Root cause exploration** |
  • What events escalated into problems?
  • Under what conditions did the problem occur?
  • What other issues surround the core problem?
| | Step 4
**Root Cause Exploration** |
  • Why do causal factors exist?
  • What is the real cause that triggers the problem?
| | Step 5
**Propose and Implement Solutions** |
  • What should be done to prevent the problem from recurring?
  • How can the solution be implemented?
  • Who is responsible for it?
  • What are the risks associated with implementing the solution?
|
표. RCA 문제 정의

Performing Knowledge Management

The documentation provides instructions for executing tasks to resolve issues caused by external or internal events. Occasionally, the operations team delays documentation updates, often leaving outdated manuals neglected.

If documentation is insufficient, operations rely on individuals, increasing the risk of errors. Therefore, system operations must always be maintained independently of human intervention, and a process for documenting every aspect must be established.

To enable new team members to quickly resolve similar issues by referencing existing incident cases and solutions, documentation automation via scripts is required to ensure that documentation is automatically updated when the system changes.

The documentation should include defined Service Level Agreements (SLAs) related to Recovery Time Objective/Recovery Point Objective (RTO/RPO), latency, scalability performance, etc.

System administrators maintain documentation that includes system startup, shutdown, patching, and update steps, and the operations team must include system testing and verification results along with event response procedures in the documentation.

The operations team should automate the process of applying changes to the system, building, and then adding comments to the documentation; in this case, comments can be used to automate tasks and are easily readable in code.

Since business priorities and customer requirements continually change and evolve over time, maintaining an operational environment to support them is a key success factor for operations.