The page has been translated by Gen AI.

Operational Planning

Identify and Define Operational Requirements

Operational excellence means continuous improvement activities to run applications stably with minimal downtime in order to maximize business value and increase system efficiency.

The Operations team is responsible for managing the Application’s infrastructure, security, and all software-related issues, ensuring that the Application operates reliably.

Especially for enterprise applications, availability must be clearly defined through a Service Level Agreement (SLA), so the operations team must fully understand business requirements and be able to respond quickly to the various events that may arise.

First, it is necessary to evaluate the factors that could impact operations among the laws and regulatory requirements applicable to the organization’s industry and activities.

It is advisable to check the compliance items of the security design principles (Security Design Principles > I. Security Requirements Analysis and Design Principles > 1. Compliance and Security Requirements).

Among various compliance requirements, the Information Security Management System (ISMS) is a certification that companies or organizations of a certain scale must obtain when providing services to end users or consumers, making it the most common and essential compliance requirement.

The table below summarizes the contents that correspond to the operational management area among ISMS control items.

When establishing operational excellence design principles, it is important to closely examine, in particular, the aspects of change management, performance management, and incident management among these items.

Category	Control items	Explanation
Information system acquisition and development security	Separate test and production environments	Development and test systems must be separated in principle to reduce the risk of unauthorized access to and modification of the production system.
Information system acquisition and development security	Source program management	Source code should be managed to allow access only to authorized users and should not be stored in the production environment.
Information system acquisition and development security	Production environment migration	When transferring a newly introduced, developed, or modified system to the production environment, a controlled procedure must be followed, and the executable code must be run according to testing and user acceptance procedures.
System and Service Operations Management	Change Management	Establish and implement procedures to manage all changes to information system assets, and analyze the impact of changes on system performance and security before they are made.
System and Service Operations Management	Performance and Incident Management	To ensure the availability of information systems, performance and capacity requirements must be defined, and the status must be continuously monitored. In the event of a failure, procedures such as detection, logging, analysis, recovery, and reporting must be established and managed to respond effectively.

Table. ISMS control items related to operational excellence

Among various management frameworks, IT Service Management (ITSM, IT Service Management) is a process-based approach to effectively design, build, deliver, operate, and manage IT services in line with business requirements, and frameworks such as the IT Infrastructure Library (ITIL: Information Technology Infrastructure Library) represent this.

ITSM approaches from the perspective of the entire service lifecycle rather than a technology-centric view, and can be described as the most common and essential operational management framework that organizations adopt to provide stable and efficient services to end users.

Category	Control items	Explanation
Service transition	Change Management	Manage all changes to IT services and infrastructure according to controlled procedures to minimize service interruptions and business risks caused by changes.
Service transition	Release and Deployment Management	Plan and control the entire lifecycle of safely and successfully deploying and transferring approved changes to the production environment.
Service operation	Incident Management	When unexpected service interruptions or quality degradation (failures) occur, quickly detect, log, analyze, and restore the service to minimize business impact.
Service operation	Issue Management	Identify the root cause of failures and establish preventive measures to avoid recurrence, thereby preventing failures in advance and ensuring long-term stability.
Service operation	Service Level Management	Continuously monitor, measure, report, and perform improvement activities to ensure that the service level objectives defined in the SLA (such as availability, performance, etc.) are being met.
Service operation	Capacity Management	Secure and manage the IT resources and performance needed for current and future business requirements cost‑effectively, and monitor to prevent any performance degradation.

Table. Core ITSM processes related to operational excellence

Separation of test environment and production environment

Development and test environments must be configured separately from production systems by ensuring logical and physical separation at the VPC and account (Account) level.

Additionally, it is advisable to strictly restrict developers from directly accessing the production system through IAM(Identity and Access Management) policies.

When the organization is small or lacks human resources, making it difficult to separate development and operations tasks, this can be compensated through peer reviews among staff, supervision by superiors, and review and approval procedures for changes.

Source program management

Access to the source program should be granted only to authorized developers, and regular backups are necessary to prepare for emergencies.

Additionally, the change history of the source program must be systematically recorded, and any changes must undergo a review and approval process.

Typically, source code is managed using configuration management tools such as Git and SVN, and these tools provide features such as access permission settings, version control, and change history tracking.

Migrate to production environment

When moving a system that has completed development to the production environment, the migration must be performed by a dedicated person who is not the developer.

When migrating, you must review the test completion status, migration strategy, rollback plan in case of issues, and other factors in advance, and through this you can reduce the risk of operational transition.

In a DevOps environment, the development and deployment of applications are automated, and there are cases where developers perform the deployment themselves.

In such cases, rather than granting deployment rights to all developers, it is more effective to delegate authority to specific individuals and control the process by setting up an approval procedure in the automated deployment workflow.

Change Management

When asset changes such as operating system upgrades, commercial software installations, improvements to running applications, network configuration changes, or server specification changes are required, a clear procedure must be established and strictly followed.

Even in the Samsung Cloud Platform environment, systematic management of architecture, virtual server changes, image upgrades, and related matters is required.

To this end, we must use IaC (Infrastructure as Code) tools to template the changes and thoroughly document each change.

Additionally, you should review the changes by comparing the before-and-after states to minimize any unexpected impact.

In particular, for large-scale changes, it is important to conduct impact analysis based on the significance of the change and to prepare recovery measures in advance (e.g., rolling back to a previous version of the IaC code) in case the change fails.

Through this, you can protect the system from unexpected issues that may arise during change operations and quickly restore it to normal operation.

Additionally, in cloud environments, it is necessary to automate change tasks using automation tools and to monitor the implementation status of changes in real time through monitoring and alerting systems.

This can enhance the stability and efficiency of change operations.

Additionally, regular reviews and audits should be conducted to continuously improve the quality of IaC code and processes, and to ensure compliance with security and regulatory requirements.

Through this systematic approach, you can efficiently perform operational change management in cloud environments and maintain system stability and reliability.

Performance and Fault Management

To ensure the availability of an information system, procedures must be established that include criteria for identifying performance and capacity management targets, definition of performance and capacity requirements (thresholds), monitoring methods, recording and analysis of results, and response measures when thresholds are exceeded.

To manage the performance of servers, networks, databases, applications, etc., it is advisable to define performance management components such as CPU utilization, memory utilization, and disk usage, as well as performance metrics like response time and throughput, and to use tools that can monitor them in real time.

For fault detection and response, you must define fault types and severity levels, and clearly specify the reporting procedures, detection methods, and responsibilities and roles for response and recovery.

Additionally, it is necessary to establish a response system that includes customer notification procedures and an emergency contact system in the event of a failure.

The incident response history must be documented, including the incident occurrence date and time, severity, responsible staff and manager, incident details and cause, actions taken and recovery details, and measures to prevent recurrence, and it should be managed in the form of an incident handling report.

In addition to the mandatory operational management activities required by these compliance requirements, resource operation management items in the cloud environment must also be reviewed.

The table below summarizes the management items for Managed Service as an example of cloud operations management items.

Item	Explanation
Billing/Report	Cloud monthly usage fee billing CSP technical support level management Credit and other billing deduction element management
Service support	Service request receipt Service operation execution time (24h X 365d, 9h X 5d, etc.) Service processing result sharing
Resource Management	Cloud resource management (resource creation/modification/deletion) Data migration Backup & recovery
OS operation	System Update and Patch Management Performance Optimization System Configuration Management
Incident response	Disruption notification (notification list management, notification transmission, notification method, notification time) Incident response time (SLA-based) Incident analysis report Incident reproduction test
Technical Support	Issue Response Architecture Planning/Improvement Support Cost Optimization Support Resource Usage Review
Security	System Security Assessment and Remediation Regular Review of User and Permission Management Security Policy Management and Enforcement Security Announcements
Monitoring	Monitoring Settings Threshold Application and Management Monitoring Agent Management
Report	Regular reports (monthly operation report, annual plan) Irregular reports (audit response, usage risk report, service interruption notice and work, etc.)

Table. Managed Service management items

Cloud Operations Organization Structure

To achieve true operational excellence in cloud environments, a collaborative, team‑oriented organizational culture that can effectively support advanced technology adoption is essential.

This is because the core goal of cloud operational excellence is to achieve both business speed (Agility) and service reliability (Reliability)—two values once considered conflicting—simultaneously and in a balanced manner.

Traditional IT organizations in the past were based on a structure where development and operations teams were clearly separated. Under this structure, the development team prioritized rapid release of new features, while the operations team prioritized uninterrupted, fault‑free stability.

These fundamental differences in objectives inevitably caused interdepartmental goal conflicts, and changes from the development team created bottlenecks during the operations team’s stability review stage, becoming the main cause of reduced business agility.

To address these chronic traditional operational problems and achieve the shared goals of speed and stability, the DevOps culture emerged.

DevOps organization composition means not merely separating development (Dev) and operations (Ops) into distinct teams, but establishing a collaborative framework that aligns the goals of both groups and shares responsibilities across the entire service lifecycle (planning, development, deployment, operation).

Based on this DevOps culture, the role of modern cloud operations organizations is fundamentally redefined.

We no longer remain in the traditional role of controlling and managing changes to ensure stability.

Instead, to support both business speed and stability, we build and provide an automated CI/CD pipeline, Infrastructure-as-Code (IaC) templates, and a platform with built-in monitoring and security, enabling development teams to deploy quickly and safely on their own.

Category	Traditional IT operations organization	Cloud Operations Organization
Operational priority	Department-specific goal optimization (e.g., development focuses on features, operations on stability)	Shared business objectives (fast deployment and reliable service)
Role distinction	Clear distinction of technical areas (server, network, DB, security)	Emphasizing automation and efficiency, executes multiple technology domains synergistically based on a diverse technology stack.
Main role	System Administrator, Network Administrator, Database Administrator, Security Administrator etc.	SRE, DevOps Engineer, Cloud Architect, Security Manager etc
Operation method and process	Operates focusing on tasks such as system updates and maintenance, development and operations are separated.	Performing automated updates and management tasks, development and operations are linked or integrated.

Table. Traditional IT Operations Organization and Cloud Operations Organization

The operating model in a cloud environment can vary depending on how the workload is configured.

Cloud operations within the existing organizational structure

In the process of migrating existing IT infrastructure to the cloud, it is common to either shift using a Lift & Shift approach or move only certain workloads to the cloud while keeping the rest on-premises.

In such situations, when designing a cloud operations model, you can consider two main approaches: either adding a separate cloud operations team within the existing IT operations organization, or integrating cloud operations responsibilities into the roles of the existing organization.

The first method is to add a separate cloud operations organization.

This approach involves forming a specialized team tailored to the characteristics and requirements of the cloud environment, assigning them to handle the management, monitoring, security, and optimization of cloud infrastructure.

This approach enhances expertise in cloud environments and enables focused management of each environment by separating roles from the existing on‑premises operations organization.

However, this approach has the drawback that role duplication within the organization and communication costs may arise.

The second approach is to integrate cloud operations tasks into the existing organization’s roles.

This approach enables the existing IT operations organization to manage both cloud and on‑premises environments, establishing an integrated operations model between the two.

It also helps strengthen collaboration within the organization and maintain consistent policies and processes between cloud and on‑premises environments.

However, while efficient utilization of internal resources is possible, a lack of expertise in cloud environments can reduce operational efficiency.

Concept diagram — Figure. Cloud operation within the existing organization

Infrastructure Operations After Cloud Migration

The roles of cloud infrastructure operations and development teams undergo significant changes when an organization’s primary workloads have been migrated to the cloud or newly built on a cloud platform.

Previously, there was an on-premises operations organization structured by function, but after the cloud migration, many personnel now manage cloud-based workloads, and the role of the operations organization is being reorganized to focus on cloud infrastructure.

In this structure, application-related tasks remain the responsibility of the development team, and the operating model is designed so that platform operations are restructured to align with the cloud environment.

The cloud infrastructure operations organization is designed with the characteristics of the cloud environment in mind, and operates focusing on cloud resource management, monitoring, security, and optimization tasks.

This organization leverages the cloud service provider (CSP) platform to manage infrastructure efficiently and streamlines repetitive tasks using automation tools and scripts.

It also provides flexibility to quickly respond to business requirements by leveraging the elasticity and scalability of the cloud environment.

The development team continues to handle tasks related to application development and optimizes the development process in cloud environments.

To this end, the development team designs a cloud-native architecture and builds microservice-based applications to fully leverage the advantages of the cloud environment.

Additionally, it supports fast and reliable software deployment through CI/CD pipelines and facilitates seamless integration between cloud infrastructure and applications.

Operating the DevOps system

If an organization decides to rebuild its core systems as cloud‑based CI/CD applications, the operating model will also shift to an optimized cloud One Team structure.

In this case, the cloud operations team and the development team are merged into a DevOps team, and their role is redefined as a unit responsible for continuous integration and deployment processes.

Through this, collaboration between development and operations becomes tighter, enabling rapid deployment and stable operation in cloud environments simultaneously.

Additionally, operations may be integrated in a DevSecOps model to strengthen security.

This refers to a structure where security, development, and operations work together organically as a single team, supporting the integration of security elements into development and operational processes to build secure applications from the outset.

This integrated operating model maximizes efficiency and stability in cloud environments, establishing a foundation that enables the organization to effectively achieve its business objectives.

Optimized cloud one-team operations break down boundaries between teams and foster a culture of collaboration toward shared goals.

Through this, organizations can respond quickly and flexibly even in cloud environments, and lay the foundation for continuous innovation and maintaining competitiveness.