Evaluation and repair
Evaluation and repair
Definition of the process for continuous improvement
When creating a process for continuous improvement, the first thing to define is the role.
It is necessary to designate which members have authority to perform the required tasks and to ensure visibility of the improvement flow.
The following figure is an example of a DevOps pipeline for the improvement process.
The tasks to be performed at each step are as follows.
| step | Explanation |
|---|---|
| Create project / Add user | Create a DevOps project on Samsung Cloud Platform and add users. |
| Role definition | Assign a responsible user based on the task and grant permissions. |
| Changes Commit | Commit new or updated infrastructure configuration templates or Application code to the code repository. |
| Build and test changes | Build and deploy the changed code in the test environment for testing Considering the environment presented in earlier errors or improvement items, reproduce the required test environment and perform testing. |
| Deploy code to the production environment | Deploy the tested application and infrastructure configuration to the production environment after staging. |
| Monitoring | Error: Incorporate the improvements identified in the improvement stage into the monitoring metrics and monitor the performance of the measurement indicators. |
| Error, improvement items | Summarize the errors and improvement items identified during the monitoring process. |
When configuring a process for continuous improvement as shown above, you need to consider work elements like those in the table below.
| Improvement work items | Explanation |
|---|---|
| Number of steps | In CI/CD, it may include development, integration, system, user acceptance, and production Some organizations also include development, alpha, beta, and release stages. |
| Test types for each stage | Each stage can perform various types of testing, such as unit testing, integration testing, system testing, UAT, smoke testing, load testing, and A/B testing, in the production phase. |
| Test order | Test cases can be executed in parallel or sequentially. |
| Monitoring and Reporting | Monitor system defects and failures, and send alerts when a failure occurs. |
| Infrastructure provisioning | Define the infrastructure provisioning method for each stage. |
| rollback | Define a rollback strategy that reverts to a previous version when necessary. |
Perform post-event analysis
When a failure occurs during system operation, you must learn from the mistake and identify the problem.
You must ensure that the same failure does not recur and prepare a solution in case the failure repeats.
One of the improvement measures is to conduct a root cause analysis called RCA (Root Cause Analysis), which helps analyze the underlying cause of an issue and prevent its recurrence.
RCA performs problem analysis through the following five-step questions.
| step | Question |
|---|---|
| Step 1 Problem Definition |
|
| Step 2 Data Collection |
|
| Step 3 Cause Element Exploration |
|
| Step 4 Root Cause Analysis |
|
| Step 5 Propose and Execute Solutions |
|
Perform knowledge management
The documentation provides the procedures for executing tasks to resolve issues caused by external or internal events. Occasionally, the operations team delays updating the documents, leaving outdated documentation neglected.
If the documentation is insufficient, work relies on people, increasing the risk of errors; therefore, system operation must always be kept separate from humans, and a process for documenting every aspect must be established.
To enable new team members to quickly resolve similar issues by referring to existing incident cases and solutions, documentation automation via scripts is required so that manuals are automatically updated when the system changes.
The documentation must include a service level agreement (SLA) defined for recovery time objective/recovery point objective (RTO/RPO), latency, scalability performance, and related aspects.
System administrators maintain documentation that includes the system start, stop, patch, and update phases, and the operations team must include system test and verification results in the documentation along with the event response procedures.
It is advisable for the operations team to automate the process of applying changes to the system, building, and then adding comments to the documentation; in this case, comments can be used to automate tasks and are easily readable in code.
Because business priorities and customer requirements continuously change and evolve over time, maintaining the operating environment to support them is a key success factor for operations.
