Data Ops-Based Workflow Creation and Management
Data Ops-Based Workflow Creation and Management
Overview
Data Ops is a managed workflow orchestration service based on Apache Airflow that creates and automates workflows for periodic and repetitive data processing tasks.
It can be used independently in the Samsung Cloud Platform’s Kubernetes Engine cluster environment or with other application software.
Architecture Diagram
System Manager applies for the Data Ops service to manage workflows for periodic and repetitive data processing tasks (extraction/loading/transformation/refining).
Data Engineer can modify the settings of the Data Ops service and manage additional plugins/library files through Ops Manager.
The Data Ops service is based on Apache Airflow and allows writing, scheduling, and monitoring workflows in DAG (Directed Acyclic Graph) format.
- The worker that executes actual tasks runs dynamically.
- The worker that executes actual tasks runs dynamically.
It can perform workflow-based tasks in conjunction with various systems such as Data Flow, Cloud Hadoop, Legacy System, and Object Storage.
Use Cases
Data-Driven Workflow Orchestration
Data Ops can orchestrate data-driven workflows, especially ETL/ELT.
It automatically organizes, monitors, and executes workflows.
It can be used as a scenario where tasks are executed through Spark and the results are stored in Cloud Hadoop.
Batch Workloads
It can be used as a pipeline that performs tasks such as fetching and transforming data from multiple sources in ETL pipelines or ELT tasks.
It can increase the visibility of batch processes and shorten the development cycle by separating batch tasks.
It is suitable for batch processing tasks that can handle delays between task executions.
Enterprise Scheduling
By linking with command shells, APIs, and enterprise execution containers, it can be scheduled with existing application tools.
It can orchestrate data pipeline services by communicating with existing services.
Pre-requisites
None
Limitations
None
Considerations
To use Data Ops, an Ingress Controller must exist within the cluster.
Related Services
This is a list of Samsung Cloud Platform services related to the features or configurations described in this guide. Refer to it when selecting and designing services.
| Service Group | Service | Detailed Description |
|---|---|---|
| Container | Kubernetes Engine | Kubernetes container orchestration service |
| Storage | File Storage | Storage that allows multiple client servers to share files through network connections |
| Storage | Object Storage | Object storage that is convenient for data storage and retrieval |
| Networking | VPC | Service that provides an independent virtual network in a cloud environment |
| Networking | Security Group | Virtual firewall that controls VM traffic |
| Networking | Load Balancer | Service that automatically distributes server traffic loads |
| Data Analytics | Data Flow | Service that automates data processing flows by extracting/transforming/transferring data from various sources |
