The page has been translated by Gen AI.

Overview

Service Overview

Multi-node GPU Cluster is a service that provides physical GPU servers without virtualization for large-scale high-performance AI calculations. It can cluster multiple GPUs using two or more bare metal servers with GPUs, and can be used conveniently with Samsung Cloud Platform’s high-performance storage and networking services.

Provided Features

Multi-node GPU Cluster provides the following functions.

  • Auto Provisioning and Management: Through the web-based Console, you can easily use the standard GPU Bare Metal model server with 8 GPU cards, from provisioning to resource and cost management.
  • Network Connection: Two or more Bare Metal Servers can be clustered through high-speed interconnects to process multiple GPUs, and by configuring the GPU Direct RDMA (Remote Direct Memory Access) environment, direct data IO between GPU memories is possible, enabling high-speed AI/Machine Learning calculations.
  • Storage Connection: It provides various additional connection storages other than OS disks. High-speed network and high-performance SSD NAS File Storage, Block Storage, and Object Storage that are directly linked can also be used in conjunction.
  • Network Setting Management: The server’s subnet/IP can be easily changed with the initially set value. NAT IP provides a management function that can be used or cancelled according to needs.
  • Monitoring: You can check the monitoring information of computing resources such as CPU, GPU, Memory, Disk, etc. through Cloud Monitoring. To use the Cloud Monitoring service for Multi-node GPU Cluster, you need to install the Agent. Please install the Agent for stable service use. For more information, please refer to Multi-node GPU Cluster Monitoring Metrics.

Component

Multi-node GPU Cluster provides GPU as a Bare Metal Server type with standard images and server types, and NVSwitch and NVLink are provided.

GPU(H100)

GPU (Graphic Processing Unit) is specialized in parallel calculations that can process a large amount of data quickly, enabling large-scale parallel calculation processing in fields such as artificial intelligence (AI) and data analysis.

The following are the specifications of the GPU Type provided by the Multi-node GPU Cluster service.

ClassificationH100 Type
Product Provisioning MethodBare Metal
GPU ArchitectureNNVIDIA Hopper
GPU Memory80GB
GPU Transistors80 billion 4N TSMC
GPU Tensor Performance(based on FP16)989.4 TFLOPs, 1,978.9 TFLOPs*
GPU Memory Bandwidth3,352 GB/sec HBM3
GPU CUDA Cores16,896 Cores
GPU Tensor Cores528(4th Generation)
NVLink performanceNVLink 4
Total NVLink bandwidth900 GB/s
NVLink Signaling Rate25 Gbps (x18)
NVSwitch performanceNVSwitch 3
NVSwitch GPU bandwidth900 GB/s
Total NVSwitch Aggregate Bandwidth7.2TB/s
  • With Sparsity
Table. GPU Type Specifications

OS and GPU Driver Version

The operating systems (OS) supported by Multi-node GPU Cluster are as follows.

OSOS versionGPU driver version
Ubuntu22.04535.86.10, 535.183.06
Table. Multi-node GPU Cluster OS and GPU Driver Version

Server Type

The server types provided by Multi-node GPU Cluster are as follows. For a detailed description of the server types provided by Multi-node GPU Cluster, please refer to Multi-node GPU Cluster server type.

g2c96h8_metal
ClassificationExampleDetailed Description
Server Generationg2Provided server generation
  • g2: g means GPU server, and 2 means generation
CPUc96Number of Cores
  • c96: Assigned Core is a physical core
GPUh8GPU type and quantity
  • h8: h means GPU type, and 8 means GPU quantity
Table. Multi-node GPU Cluster server type format

Preceding Service

This is a list of services that must be pre-configured before creating this service. Please refer to the guide provided for each service and prepare in advance for more details.

Service CategoryServiceDetailed Description
NetworkingVPCA service that provides an independent virtual network in a cloud environment
Fig. Multi-node GPU Cluster Pre-service
Release Note
Server Type