Large-Scale Data Processing Architecture
Large-Scale Data Processing Architecture
Overview
Samsung Cloud Platform provides a DW-type database service using Vertica (DBaaS), an analysis-only database. The Vertica service is a service that allows you to easily build and manage a Vertica cluster, and provides a UI for managing cluster information and status.
Vertica is designed with a Masterless Pure-MPP (Massively Parallel Processing) architecture, making it suitable for parallel analysis of large-scale data, and includes in-DB machine learning and advanced analytics features. It stores not only standard information but also financial, health, monitoring, and event occurrence information, and allows users to quickly extract and analyze the information they want using various analysis features.
In the future, it will be possible to easily extract and analyze data from various data storage systems such as Object Storage or HDFS.
Architecture Diagram
- Users use various services (finance, hospital, event).
- To ensure service continuity, RDB is configured with HA, and NoSQL is configured with a cluster to store various data that users want.
- The CacheStore cache service is used to provide content to users quickly and store session information to reduce processing time.
- The Event Streams message service is used to transmit data to the target storage in real-time or batch mode.
- The Elastic Stack is used to collect data from multiple systems (Logstash), analyze and search (Search Engine), and provide various information through visualization (Kibana).
- Data stored in RDB and NoSQL is transmitted to Vertica (DBaaS) through batch processing and stored.
- Various data stored in Vertica (DBaaS) is used for analysis.
- Data stored in Object Storage or HDFS can also be linked to Vertica (DBaaS) for data analysis.
Use Cases
Building a Monitoring System
When periodically checking dozens of servers for abnormal signs and analyzing them, agents within each server are used to collect various information (server settings, system logs, software installation information, security information, etc.) and store it.
Pre-defined abnormal sign patterns can be mapped to select problematic targets.
Data-Centric Hospital
The medical records of patients from multiple contracted hospitals (disease name, occurrence location, general symptoms, special circumstances, treatment status, etc.) are stored in one place.
When a patient visits, diagnosis can be quickly made based on symptoms, and treatment status and methods can be provided to the patient.
Pre-requisites
Building the Vertica service requires customer license usage (BYOL).
Limitations
Backup of Vertica (DBaaS) is done by taking an initial full backup and then incremental snapshots, providing a restore function to a specific point in time (not transaction log-based).
Considerations
Vertica’s license has limitations on data capacity, and if data increases suddenly during service use, it may cause problems. It is necessary to estimate the data capacity to be stored in advance and purchase a license.
Related Services
This is a list of Samsung Cloud Platform services related to the features or configurations described in this guide. Please refer to it when selecting and designing services.
| Service Group | Service | Detailed Description |
|---|---|---|
| Database | PostgreSQL(DBaaS) | A service that easily creates and manages open-source PostgreSQL in a web environment |
| Database | MySQL(DBaaS) | A service that easily creates and manages a small but powerful open-source relational database MySQL |
| Database | CacheStore | A key-value in-memory data store with fast data processing capabilities |
| Storage | Object Storage | An object storage that is convenient for data storage and search |
| Data Analytics | Event Stream | A service that creates and manages an Apache Kafka cluster |
| Data Analytics | Search Engine | A service that easily creates and manages Elasticsearch in a web environment |
