The page has been translated by Gen AI.

Evaluation and repair

Definition of the process for continuous improvement

When creating a process for continuous improvement, the first thing to define is the role.

It is necessary to designate which members have authority to perform the required tasks and to ensure visibility of the improvement flow.

The following figure is an example of a DevOps pipeline for the improvement process.

flowchart — Figure. Example of DevOps pipeline improvement process

The tasks to be performed at each step are as follows.

step	Explanation
Create project / Add user	Create a DevOps project on Samsung Cloud Platform and add users.
Role definition	Assign a responsible user based on the task and grant permissions.
Changes Commit	Commit new or updated infrastructure configuration templates or Application code to the code repository.
Build and test changes	Build and deploy the changed code in the test environment for testing Considering the environment presented in earlier errors or improvement items, reproduce the required test environment and perform testing.
Deploy code to the production environment	Deploy the tested application and infrastructure configuration to the production environment after staging.
Monitoring	Error: Incorporate the improvements identified in the improvement stage into the monitoring metrics and monitor the performance of the measurement indicators.
Error, improvement items	Summarize the errors and improvement items identified during the monitoring process.

Table. DevOps pipeline improvement process

When configuring a process for continuous improvement as shown above, you need to consider work elements like those in the table below.

Improvement work items	Explanation
Number of steps	In CI/CD, it may include development, integration, system, user acceptance, and production Some organizations also include development, alpha, beta, and release stages.
Test types for each stage	Each stage can perform various types of testing, such as unit testing, integration testing, system testing, UAT, smoke testing, load testing, and A/B testing, in the production phase.
Test order	Test cases can be executed in parallel or sequentially.
Monitoring and Reporting	Monitor system defects and failures, and send alerts when a failure occurs.
Infrastructure provisioning	Define the infrastructure provisioning method for each stage.
rollback	Define a rollback strategy that reverts to a previous version when necessary.

Table. DevOps pipeline improvement work items

Perform post-event analysis

When a failure occurs during system operation, you must learn from the mistake and identify the problem.

You must ensure that the same failure does not recur and prepare a solution in case the failure repeats.

One of the improvement measures is to conduct a root cause analysis called RCA (Root Cause Analysis), which helps analyze the underlying cause of an issue and prevent its recurrence.

RCA performs problem analysis through the following five-step questions.

step	Question
Step 1 Problem Definition	What happened? What are the specific symptoms?
Step 2 Data Collection	What evidence proves that there is a problem? How long has the problem existed? What is the impact of that problem?
Step 3 Cause Element Exploration	What events developed into problems? Under what conditions did the problem occur? What other issues surround the core problem?
Step 4 Root Cause Analysis	Why do the cause elements exist? What is the real cause of that problem?
Step 5 Propose and Execute Solutions	What should be done to prevent the problem from recurring? How can the solution be implemented? Who is responsible for it? What are the risks associated with implementing the solution?

Table. RCA problem definition

Perform knowledge management

The documentation provides the procedures for executing tasks to resolve issues caused by external or internal events. Occasionally, the operations team delays updating the documents, leaving outdated documentation neglected.

If the documentation is insufficient, work relies on people, increasing the risk of errors; therefore, system operation must always be kept separate from humans, and a process for documenting every aspect must be established.

To enable new team members to quickly resolve similar issues by referring to existing incident cases and solutions, documentation automation via scripts is required so that manuals are automatically updated when the system changes.

The documentation must include a service level agreement (SLA) defined for recovery time objective/recovery point objective (RTO/RPO), latency, scalability performance, and related aspects.

System administrators maintain documentation that includes the system start, stop, patch, and update phases, and the operations team must include system test and verification results in the documentation along with the event response procedures.

It is advisable for the operations team to automate the process of applying changes to the system, building, and then adding comments to the documentation; in this case, comments can be used to automate tasks and are easily readable in code.

Because business priorities and customer requirements continuously change and evolve over time, maintaining the operating environment to support them is a key success factor for operations.