Overview
Service Overview
Multi-node GPU Cluster is a service that provides physical GPU servers without virtualization for large-scale high-performance AI calculations. It can cluster multiple GPUs using two or more bare metal servers with GPUs, and can be used conveniently with Samsung Cloud Platform’s high-performance storage and networking services.
Provided Features
Multi-node GPU Cluster provides the following functions.
- Auto Provisioning and Management: Through the web-based Console, you can easily use the standard GPU Bare Metal model server with 8 GPU cards, from provisioning to resource and cost management.
- Network Connection: Two or more Bare Metal Servers can be clustered through high-speed interconnects to process multiple GPUs, and by configuring the GPU Direct RDMA (Remote Direct Memory Access) environment, direct data IO between GPU memories is possible, enabling high-speed AI/Machine Learning calculations.
- Storage Connection: It provides various additional connection storages other than OS disks. High-speed network and high-performance SSD NAS File Storage, Block Storage, and Object Storage that are directly linked can also be used in conjunction.
- Network Setting Management: The server’s subnet/IP can be easily changed with the initially set value. NAT IP provides a management function that can be used or cancelled according to needs.
- Monitoring: You can check the monitoring information of computing resources such as CPU, GPU, Memory, Disk, etc. through Cloud Monitoring. To use the Cloud Monitoring service for Multi-node GPU Cluster, you need to install the Agent. Please install the Agent for stable service use. For more information, please refer to Multi-node GPU Cluster Monitoring Metrics.
Component
Multi-node GPU Cluster provides GPU as a Bare Metal Server type with standard images and server types, and NVSwitch and NVLink are provided.
GPU(H100)
GPU (Graphic Processing Unit) is specialized in parallel calculations that can process a large amount of data quickly, enabling large-scale parallel calculation processing in fields such as artificial intelligence (AI) and data analysis.
The following are the specifications of the GPU Type provided by the Multi-node GPU Cluster service.
| Classification | H100 Type |
|---|---|
| Product Provisioning Method | Bare Metal |
| GPU Architecture | NNVIDIA Hopper |
| GPU Memory | 80GB |
| GPU Transistors | 80 billion 4N TSMC |
| GPU Tensor Performance(based on FP16) | 989.4 TFLOPs, 1,978.9 TFLOPs* |
| GPU Memory Bandwidth | 3,352 GB/sec HBM3 |
| GPU CUDA Cores | 16,896 Cores |
| GPU Tensor Cores | 528(4th Generation) |
| NVLink performance | NVLink 4 |
| Total NVLink bandwidth | 900 GB/s |
| NVLink Signaling Rate | 25 Gbps (x18) |
| NVSwitch performance | NVSwitch 3 |
| NVSwitch GPU bandwidth | 900 GB/s |
| Total NVSwitch Aggregate Bandwidth | 7.2TB/s |
- With Sparsity
OS and GPU Driver Version
The operating systems (OS) supported by Multi-node GPU Cluster are as follows.
| OS | OS version | GPU driver version |
|---|---|---|
| Ubuntu | 22.04 | 535.86.10, 535.183.06 |
Server Type
The server types provided by Multi-node GPU Cluster are as follows. For a detailed description of the server types provided by Multi-node GPU Cluster, please refer to Multi-node GPU Cluster server type.
g2c96h8_metal
| Classification | Example | Detailed Description |
|---|---|---|
| Server Generation | g2 | Provided server generation
|
| CPU | c96 | Number of Cores
|
| GPU | h8 | GPU type and quantity
|
Preceding Service
This is a list of services that must be pre-configured before creating this service. Please refer to the guide provided for each service and prepare in advance for more details.
| Service Category | Service | Detailed Description |
|---|---|---|
| Networking | VPC | A service that provides an independent virtual network in a cloud environment |