Overview
Service Overview
Multi-node GPU Cluster is a service that provides physical GPU servers without virtualization for large-scale high-performance AI computation. You can use two or more Bare Metal Servers equipped with GPUs to cluster multiple GPUs, and conveniently operate GPU servers in conjunction with Samsung Cloud Platform’s high‑performance storage and networking services.
Provided Features
The Multi-node GPU Cluster provides the following features.
- Auto Provisioning and Management: Through the web-based Console, you can easily provision servers of the standard GPU Bare Metal model equipped with 8 GPUs and manage resources and costs.
- Network Connection: You can cluster multiple GPUs on two or more Bare Metal Servers via high‑speed interconnects, and by configuring a GPU Direct RDMA (Remote Direct Memory Access) environment, you can directly process data I/O between GPU memories, enabling high‑speed AI/Machine Learning computation.
- Storage Connection: Provides various additional attached storage besides the OS disk. * High-performance SSD NAS File Storage, Block Storage, and Object Storage directly integrated with a high-speed network can also be used together.
- Network Configuration Management: The server’s subnet/IP can be easily changed from the values set at initial creation. * NAT IP provides a management feature that allows you to enable or disable it as needed.
- Monitoring: You can view monitoring information for computing resources such as CPU, GPU, Memory, and Disk through Cloud Monitoring. * To use the Cloud Monitoring service of a Multi-node GPU Cluster, you need to install the Agent. * Please install the Agent to ensure stable service. * For more details, please refer to Multi-node GPU Cluster Monitoring Metrics.
- Terraform Provision: Provides an IaC environment via Terraform.
Component
Multi-node GPU Cluster provides GPUs as a Bare Metal Sever type with standard images and server types. NVSwitch and NVLink are provided.
Specifications by GPU Type
GPU (Graphic Processing Unit) is specialized for parallel operations that process large amounts of data quickly, enabling large-scale parallel computation in fields such as artificial intelligence (AI) and data analysis.
The following are the specifications of GPU types offered by the Multi-node GPU Cluster service.
| Category | H100 Type | B300 Type |
|---|---|---|
| GPU Architecture | NVIDIA Hopper | NVIDIA Blackwell Ultra |
| GPU Memory | 80 GiB | 268 GiB |
| GPU Transistors | 80 billion 4N TSMC | 208 billion 4NP TSMC |
| FP16 Tensor Core (Dense) | 989 TFLOPs | 2.25 PFLOPs |
| FP8 Tensor Core (Dense) | 1979 TFLOPs | 4.5 PFLOPs |
| FP4 Tensor Core (Dense) | Not supported | 13.5 PFLOPs |
| GPU Memory Bandwidth | 3,352 GB/s HBM3 | 8 TB/s HBM3e |
| NVLink performance | NVLink 4 | NVLink 5 |
| NVLink Signaling Rate | 25 GB/s (x18) | 50 GB/s (x18) |
| NVSwitch GPU-to-GPU bandwidth | 900 GB/s | 1.8 TB/s |
| Total NVSwitch aggregate bandwidth | 7.2 TB/s | 14.4 TB/s |
OS and GPU driver version
The operating systems (OS) supported by the Multi-node GPU Cluster are as follows.
| OS | OS version | GPU driver version |
|---|---|---|
| Ubuntu | 22.04 | 535.86.10, 535.183.06 |
| Ubuntu | 24.04 | 580.105.08 |
Server type
The format of server types provided by the Multi-node GPU Cluster is as follows.
- Example: when the server type is g2c96h8_metal
| Category | example | Detailed description |
|---|---|---|
| Server generation | g2 | Provided server generation
|
| CPU | c96 | Number of cores
|
| GPU | h8 | GPU type and quantity
|
Preceding Service
This is a list of services that must be pre-configured before creating the service. Please refer to the guide provided for each service and prepare in advance.
| Service Category | service | Detailed description |
|---|---|---|
| Networking | VPC | A service that provides an isolated virtual network in a cloud environment |