The page has been translated by Gen AI.

Overview

Service Overview

Multi-node GPU Cluster is a service that provides physical GPU servers without virtualization for large-scale high-performance AI calculations. It can cluster multiple GPUs using two or more bare metal servers with GPUs, and can be used conveniently with Samsung Cloud Platform’s high-performance storage and networking services.

Provided Features

Multi-node GPU Cluster provides the following functions.

Auto Provisioning and Management: Through the web-based Console, you can easily use the standard GPU Bare Metal model server with 8 GPU cards, from provisioning to resource and cost management.
Network Connection: Two or more Bare Metal Servers can be clustered through high-speed interconnects to process multiple GPUs, and by configuring the GPU Direct RDMA (Remote Direct Memory Access) environment, direct data IO between GPU memories is possible, enabling high-speed AI/Machine Learning calculations.
Storage Connection: It provides various additional connection storages other than OS disks. High-speed network and high-performance SSD NAS File Storage, Block Storage, and Object Storage that are directly linked can also be used in conjunction.
Network Setting Management: The server’s subnet/IP can be easily changed with the initially set value. NAT IP provides a management function that can be used or cancelled according to needs.
Monitoring: You can check the monitoring information of computing resources such as CPU, GPU, Memory, Disk, etc. through Cloud Monitoring. To use the Cloud Monitoring service for Multi-node GPU Cluster, you need to install the Agent. Please install the Agent for stable service use. For more information, please refer to Multi-node GPU Cluster Monitoring Metrics.

Component

Multi-node GPU Cluster provides GPU as a Bare Metal Server type with standard images and server types, and NVSwitch and NVLink are provided.

GPU(H100)

GPU (Graphic Processing Unit) is specialized in parallel calculations that can process a large amount of data quickly, enabling large-scale parallel calculation processing in fields such as artificial intelligence (AI) and data analysis.

The following are the specifications of the GPU Type provided by the Multi-node GPU Cluster service.

Classification	H100 Type
Product Provisioning Method	Bare Metal
GPU Architecture	NNVIDIA Hopper
GPU Memory	80GB
GPU Transistors	80 billion 4N TSMC
GPU Tensor Performance(based on FP16)	989.4 TFLOPs, 1,978.9 TFLOPs*
GPU Memory Bandwidth	3,352 GB/sec HBM3
GPU CUDA Cores	16,896 Cores
GPU Tensor Cores	528(4th Generation)
NVLink performance	NVLink 4
Total NVLink bandwidth	900 GB/s
NVLink Signaling Rate	25 Gbps (x18)
NVSwitch performance	NVSwitch 3
NVSwitch GPU bandwidth	900 GB/s
Total NVSwitch Aggregate Bandwidth	7.2TB/s

With Sparsity

Table. GPU Type Specifications

OS and GPU Driver Version

The operating systems (OS) supported by Multi-node GPU Cluster are as follows.

OS	OS version	GPU driver version
Ubuntu	22.04	535.86.10, 535.183.06

Table. Multi-node GPU Cluster OS and GPU Driver Version

Server Type

The server types provided by Multi-node GPU Cluster are as follows. For a detailed description of the server types provided by Multi-node GPU Cluster, please refer to Multi-node GPU Cluster server type.

g2c96h8_metal

Classification	Example	Detailed Description
Server Generation	g2	Provided server generation g2: g means GPU server, and 2 means generation
CPU	c96	Number of Cores c96: Assigned Core is a physical core
GPU	h8	GPU type and quantity h8: h means GPU type, and 8 means GPU quantity

Table. Multi-node GPU Cluster server type format

Preceding Service

This is a list of services that must be pre-configured before creating this service. Please refer to the guide provided for each service and prepare in advance for more details.

Service Category	Service	Detailed Description
Networking	VPC	A service that provides an independent virtual network in a cloud environment

Fig. Multi-node GPU Cluster Pre-service

Release Note

Server Type