The page has been translated by Gen AI.

Evaluation and repair

Evaluation and repair

Definition of the process for continuous improvement

When creating a process for continuous improvement, the first thing to define is the role.

It is necessary to designate which members have authority to perform the required tasks and to ensure visibility of the improvement flow.

The following figure is an example of a DevOps pipeline for the improvement process.

flowchart
Figure. Example of DevOps pipeline improvement process

The tasks to be performed at each step are as follows.

stepExplanation
Create project / Add userCreate a DevOps project on Samsung Cloud Platform and add users.
Role definitionAssign a responsible user based on the task and grant permissions.
Changes CommitCommit new or updated infrastructure configuration templates or Application code to the code repository.
Build and test changesBuild and deploy the changed code in the test environment for testing
Considering the environment presented in earlier errors or improvement items, reproduce the required test environment and perform testing.
Deploy code to the production environmentDeploy the tested application and infrastructure configuration to the production environment after staging.
MonitoringError: Incorporate the improvements identified in the improvement stage into the monitoring metrics
and monitor the performance of the measurement indicators.
Error, improvement itemsSummarize the errors and improvement items identified during the monitoring process.
Table. DevOps pipeline improvement process

When configuring a process for continuous improvement as shown above, you need to consider work elements like those in the table below.

Improvement work itemsExplanation
Number of stepsIn CI/CD, it may include development, integration, system, user acceptance, and production
Some organizations also include development, alpha, beta, and release stages.
Test types for each stageEach stage can perform various types of testing, such as unit testing, integration testing, system testing, UAT, smoke testing, load testing, and A/B testing, in the production phase.
Test orderTest cases can be executed in parallel or sequentially.
Monitoring and ReportingMonitor system defects and failures, and send alerts when a failure occurs.
Infrastructure provisioningDefine the infrastructure provisioning method for each stage.
rollbackDefine a rollback strategy that reverts to a previous version when necessary.
Table. DevOps pipeline improvement work items

Perform post-event analysis

When a failure occurs during system operation, you must learn from the mistake and identify the problem.

You must ensure that the same failure does not recur and prepare a solution in case the failure repeats.

One of the improvement measures is to conduct a root cause analysis called RCA (Root Cause Analysis), which helps analyze the underlying cause of an issue and prevent its recurrence.

RCA performs problem analysis through the following five-step questions.

stepQuestion
Step 1
Problem Definition
  • What happened?
  • What are the specific symptoms?
Step 2
Data Collection
  • What evidence proves that there is a problem?
  • How long has the problem existed?
  • What is the impact of that problem?
Step 3
Cause Element Exploration
  • What events developed into problems?
  • Under what conditions did the problem occur?
  • What other issues surround the core problem?
Step 4
Root Cause Analysis
  • Why do the cause elements exist?
  • What is the real cause of that problem?
Step 5
Propose and Execute Solutions
  • What should be done to prevent the problem from recurring?
  • How can the solution be implemented?
  • Who is responsible for it?
  • What are the risks associated with implementing the solution?
Table. RCA problem definition

Perform knowledge management

The documentation provides the procedures for executing tasks to resolve issues caused by external or internal events. Occasionally, the operations team delays updating the documents, leaving outdated documentation neglected.

If the documentation is insufficient, work relies on people, increasing the risk of errors; therefore, system operation must always be kept separate from humans, and a process for documenting every aspect must be established.

To enable new team members to quickly resolve similar issues by referring to existing incident cases and solutions, documentation automation via scripts is required so that manuals are automatically updated when the system changes.

The documentation must include a service level agreement (SLA) defined for recovery time objective/recovery point objective (RTO/RPO), latency, scalability performance, and related aspects.

System administrators maintain documentation that includes the system start, stop, patch, and update phases, and the operations team must include system test and verification results in the documentation along with the event response procedures.

It is advisable for the operations team to automate the process of applying changes to the system, building, and then adding comments to the documentation; in this case, comments can be used to automate tasks and are easily readable in code.

Because business priorities and customer requirements continuously change and evolve over time, maintaining the operating environment to support them is a key success factor for operations.