Data Storage Design
Data Storage Design
Storage Selection
Storage is one of the key elements that affects application performance.
All software applications interact with storage for installation, logging, and file access.
The optimal storage solution may vary depending on the following factors.
| Access Method | Storage Considerations |
|---|---|
| Access Pattern | Sequential access, random access |
| Access Frequency | Online (Hot), offline (Warm), archive (Cold) |
| Update Frequency | High update frequency (operating system, database volume), low update frequency (file storage, etc.) |
| Access Availability | Single instance connection, shared connection |
The storage options provided by Samsung Cloud Platform are as follows.
| Category | Block Storage | File Storage | Object Storage | Archive Storage |
|---|---|---|---|---|
| Function | Data is stored in fixed-size blocks and directly assigned to servers for high-availability storage service | Provides file-level storage for heterogeneous clients over the network | Allows users to store and use data on the internet as an object storage service | Storage service suitable for long-term storage of large amounts of data |
| Access Configuration | VM direct connection / Multi-Attach | NFS / CIFS | REST API (S3 compatible) | Connected to Object Storage |
| Access Control | - | Public IP / Server / VPC Endpoint | Public IP / Server / VPC Endpoint | Project public / private access setting |
| Encryption | KMS encryption volume selection | AES256 encryption applied by default | AES256 encryption optional | AES256 encryption optional |
| Data Protection | VM snapshot | Snapshot | Version management | - |
| Disk | SSD | SSD / HDD | - | - |
| Capacity | Default OS: 16-12,288GB, additional volume: 8GB-12,288GB, up to 23 additional volumes | No limit | No limit | No limit |
| Purpose | Operating system, database, and other high-throughput data storage | Web content management, entertainment data processing, container storage, big data analysis | Web content, log, and other object storage | Long-term preservation of large amounts of data |
Database Selection
In general, databases are used to standardize common platforms and increase management efficiency.
The appropriate database should be selected based on data requirements, and incorrect selection can lead to increased system latency and performance degradation.
Database selection varies depending on factors such as availability, scalability, data structure, throughput, and durability, which are required by the application.
When selecting a database, access patterns have a significant impact on technology selection, so it is desirable to optimize the database based on this.
Most databases provide configuration options for workload optimization, and operational aspects such as memory, cache, storage optimization, scalability, backup, recovery, and maintenance can be reviewed together.
In this document, we will explore various features to meet the database requirements of applications.
- OLTP (Online Transaction Processing)
Most existing relational databases use online transaction processing (OLTP).
Samsung Cloud Platform provides managed database services for relational databases, including EPAS, MySQL, MariaDB, PostgreSQL, and Microsoft SQL Server.
Relational databases are suitable for applications that process complex business transactions, such as finance and e-commerce, and are advantageous for data aggregation and complex query processing.
Considerations for optimizing relational databases include:
- Selecting server types, including computing, memory, storage, and networking
- Configuring storage volumes
- Selecting the appropriate database engine
- Database options such as schema, index, and view
Relational databases can increase throughput through vertical scaling and can also scale horizontally for read operations using replicas.
- OLAP (Online Analytical Processing)
For analyzing large amounts of structured data, a data warehouse platform can be used, and Samsung Cloud Platform provides a column-based high-performance MPP analysis environment through Vertica (DBaaS).
Recent data warehouse technologies adopt column formats and use MPP (Massive Parallel Processing) to improve data analysis speed.
Using column formats means that when aggregating data from only one column, it is not necessary to scan the entire table.
This reduces the amount of data scanned, resulting in improved query performance compared to row formats. MPP stores data in a distributed manner among lower nodes and performs queries on the leader node.
The leader node distributes queries to lower nodes based on the partition key.
Here, each node selects a portion of the query and performs parallel processing.
After that, the leader node collects query results from each lower node and returns the aggregated results.
Through this parallel processing method, query progress speeds up, and a large amount of data can be processed more quickly.
- NoSQL
In various applications such as social media, the Internet of Things, clickstream data, and logs, a large amount of unstructured and semi-structured data is generated.
This data has a dynamic schema, and each record can have a different structure.
Storing this data in a relational database can be inefficient.
Relational databases must store data based on a fixed schema, so unnecessary null values may be stored, or data loss may occur.
Unstructured or NoSQL databases can store data flexibly without being bound by a fixed schema.
Each record can have a different number of columns and can be stored in the same table.
NoSQL databases can store large amounts of data and provide low latency.
Additionally, nodes can be easily added as needed, and horizontal expansion is supported by default.
However, since NoSQL databases do not support complex queries such as table and entity joins, using a relational database is more suitable in such cases.
On the Samsung Cloud Platform, CacheStore can be used as an in-memory database based on Redis, which can be used for high-performance database caching or application state storage.
- Data Search
There are cases where a large amount of data needs to be searched quickly to solve problems or gain business insights.
Searching application data helps access detailed information and analyze it from various perspectives.
To search data with low latency and high throughput, search engine technology must be used.
The Samsung Cloud Platform provides a Search Engine service.
The Search Engine automates the creation and setup of ElasticSearch for data analysis.
The Search Engine can be deployed on a VM, and its availability and performance can be improved through cluster and replica configuration.
Database Performance Improvement
DB Optimization
Database performance improvement refers to designing and operating a database to maintain its performance for as long as possible, and the efficiency of the entire business, such as response speed or throughput per unit time, is more important than server performance management.
As the business continues to change, the number of concurrent users increases, and the amount of data continues to grow, database performance deteriorates.
Generally, database performance is defined as the response time to user requests.
Optimal database performance can be said to be achieving the best performance with the minimum resources.
Factors that deteriorate database performance can occur from the initial analysis and design stages to the development and operation stages.
| Stage | Optimization Section | Content |
|---|---|---|
| Analysis | Business Process Optimization | Remove inefficient elements, perform process optimization that fits the business vision and strategy |
| Analysis | Architecture | Set the direction of the architecture considering transaction throughput, performance, data growth trend, security, and availability |
| Design | Physical Design | Perform design considering response time, distributed DB environment, number of concurrent users, data size, parallel processing, and distribution, concentration, and redundancy |
| Design | Application Design | Design to achieve optimal performance in conjunction with the DB, access path, data request type, and index |
| Development | SQL | Improve developer skills and develop standards to comply with performance policies |
| Operation | OS Tuning | Perform tuning for CPU, memory, disk I/O, etc. |
| Operation | Network Tuning | Perform tuning according to the amount of data, files, etc. transferred |
| Operation | DB Tuning | Perform tuning for data architecture, parameters, log files, etc. |
| Operation | Application Tuning | Continuously monitor the operating system and perform tuning by reflecting SQL, index policy, cluster policy, etc. for applications with poor performance |
Caching Implementation
Caching is a process of temporarily storing data or files in an intermediate location between the client and permanent storage to process future requests more quickly and reduce network throughput.
Caching can improve application speed and reduce costs by reusing previously searched data. The following content shows the caching mechanism at each level.
| Level | Target | Caching Implementation |
|---|---|---|
| Web Layer | Web Content | Improve web server content transmission delay → Use Global CDN for content transmission |
| Application Layer | User Session Data | Use key/value storage and local cache to improve application performance and data access performance → Use CacheStore for state management |
| Database Layer | Data | Use database buffer and key/value storage to reduce latency when requesting database queries → Implement data caching using CacheStore, and offload read load using replica configuration |
The performance efficiency of the web layer is mainly related to the transmission of static content such as images, videos, and HTML pages.
This static content can be provided from a location closer to the user, reducing latency and allowing for faster response.
Using Global CDN for caching allows content to be transmitted from a location closer to the user, providing a better user experience.
By applying caching to the application layer, the results of complex repeated requests can be stored, reducing business logic calculations and database access. Furthermore, implementing a state management database to separate state storage from the application server allows you to improve service performance while avoiding session loss or concentration when scaling servers horizontally.
In general, the speed and throughput of the entire service depend on the performance of the database.
For services that use relational databases, it is not possible to increase resources by scaling servers horizontally, and vertical scaling has limitations, so a lot of effort is required for performance management.
Applying caching to the database can greatly increase database throughput and reduce data search wait times.
Placing a Redis-based CacheStore in front of the database or configuring a replica of the Database service to distribute read loads is also an effective strategy for improving performance.