This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Overview

Service Overview

Multi-node GPU Cluster is a service that provides physical GPU servers without virtualization for large-scale high-performance AI calculations. It can cluster multiple GPUs using two or more bare metal servers with GPUs, and can be used conveniently with Samsung Cloud Platform’s high-performance storage and networking services.

Provided Features

Multi-node GPU Cluster provides the following functions.

  • Auto Provisioning and Management: Through the web-based Console, you can easily use the standard GPU Bare Metal model server with 8 GPU cards, from provisioning to resource and cost management.
  • Network Connection: Two or more Bare Metal Servers can be clustered through high-speed interconnects to process multiple GPUs, and by configuring the GPU Direct RDMA (Remote Direct Memory Access) environment, direct data IO between GPU memories is possible, enabling high-speed AI/Machine Learning calculations.
  • Storage Connection: It provides various additional connection storages other than OS disks. High-speed network and high-performance SSD NAS File Storage, Block Storage, and Object Storage that are directly linked can also be used in conjunction.
  • Network Setting Management: The server’s subnet/IP can be easily changed with the initially set value. NAT IP provides a management function that can be used or cancelled according to needs.
  • Monitoring: You can check the monitoring information of computing resources such as CPU, GPU, Memory, Disk, etc. through Cloud Monitoring. To use the Cloud Monitoring service for Multi-node GPU Cluster, you need to install the Agent. Please install the Agent for stable service use. For more information, please refer to Multi-node GPU Cluster Monitoring Metrics.

Component

Multi-node GPU Cluster provides GPU as a Bare Metal Server type with standard images and server types, and NVSwitch and NVLink are provided.

GPU(H100)

GPU (Graphic Processing Unit) is specialized in parallel calculations that can process a large amount of data quickly, enabling large-scale parallel calculation processing in fields such as artificial intelligence (AI) and data analysis.

The following are the specifications of the GPU Type provided by the Multi-node GPU Cluster service.

ClassificationH100 Type
Product Provisioning MethodBare Metal
GPU ArchitectureNNVIDIA Hopper
GPU Memory80GB
GPU Transistors80 billion 4N TSMC
GPU Tensor Performance(based on FP16)989.4 TFLOPs, 1,978.9 TFLOPs*
GPU Memory Bandwidth3,352 GB/sec HBM3
GPU CUDA Cores16,896 Cores
GPU Tensor Cores528(4th Generation)
NVLink performanceNVLink 4
Total NVLink bandwidth900 GB/s
NVLink Signaling Rate25 Gbps (x18)
NVSwitch performanceNVSwitch 3
NVSwitch GPU bandwidth900 GB/s
Total NVSwitch Aggregate Bandwidth7.2TB/s
  • With Sparsity
Table. GPU Type Specifications

OS and GPU Driver Version

The operating systems (OS) supported by Multi-node GPU Cluster are as follows.

OSOS versionGPU driver version
Ubuntu22.04535.86.10, 535.183.06
Table. Multi-node GPU Cluster OS and GPU Driver Version

Server Type

The server types provided by Multi-node GPU Cluster are as follows. For a detailed description of the server types provided by Multi-node GPU Cluster, please refer to Multi-node GPU Cluster server type.

g2c96h8_metal
ClassificationExampleDetailed Description
Server Generationg2Provided server generation
  • g2: g means GPU server, and 2 means generation
CPUc96Number of Cores
  • c96: Assigned Core is a physical core
GPUh8GPU type and quantity
  • h8: h means GPU type, and 8 means GPU quantity
Table. Multi-node GPU Cluster server type format

Preceding Service

This is a list of services that must be pre-configured before creating this service. Please refer to the guide provided for each service and prepare in advance for more details.

Service CategoryServiceDetailed Description
NetworkingVPCA service that provides an independent virtual network in a cloud environment
Fig. Multi-node GPU Cluster Pre-service

1 - Server Type

Multi-node GPU Cluster Server Type

Multi-node GPU Cluster is divided based on the provided GPU Type, and the GPU used in the Multi-node GPU Cluster is determined by the server type selected when creating a GPU Node. Please select the server type according to the specifications of the application you want to run in the Multi-node GPU Cluster.

The server types supported by Multi-node GPU Cluster are in the following format:

g2c96h8_metal
ClassificationExampleDetailed Description
Server Generationg2Provided server generation
  • g2
    • g means GPU server specification
    • 2 means generation
CPUc96Number of cores
  • c96: Assigned cores are physical cores
GPUh8GPU type and quantity
  • h8: h means GPU type, and 8 means GPU quantity
Table. Multi-node GPU Cluster server type format

g2 Server Type

The g2 server type is a GPU Bare Metal Server using NVIDIA H100 Tensor Core GPU, suitable for large-scale high-performance AI computing.

  • Provides up to 8 NVIDIA H100 Tensor Core GPUs
  • Each GPU has 16,896 CUDA cores and 528 Tensor cores
  • Supports up to 96 vCPUs and 1,920 GB of memory
  • Supports up to 100 Gbps networking speed
  • 900GB/s GPU and NVIDIA NVSwitch P2P communication
Server TypeGPUGPU MemoryCPU(Core)MemoryDiskGPU P2P
g2c96h8_metalH100640 GB96 vCore2 TBSSD(OS) 960 GB * 2, NVMeSSD 3.84 TB * 4900GB/s NVSwitch
Table. Multi-node GPU Cluster server type specification > H100 server type

2 - Monitoring Metrics

Multi-node GPU Cluster monitoring metrics

The following table shows the monitoring metrics of Multi-node GPU Cluster that can be checked through Cloud Monitoring.

Guide
Multi-node GPU Cluster requires the user to install the Agent through the guide to view monitoring metrics. Please install the Agent before using the stable service. For the Agent installation method and detailed Cloud Monitoring usage, please refer to the Cloud Monitoring guide.

Multi-node GPU Cluster [Cluster]

Performance ItemDetailed DescriptionUnit
Memory Total [Basic]Available memory bytesbytes
Memory Used [Basic]Currently used memory bytesbytes
Memory Swap In [Basic]Replaced memory bytesbytes
Memory Swap Out [Basic]Swapped out memory bytesbytes
Memory Free [Basic]Unused memory bytesbytes
Disk Read Bytes [Basic]Read bytesbytes
Disk Read Requests [Basic]Number of Read Requestscnt
Disk Write Bytes [Basic]Write bytesbytes
Disk Write Requests [Basic]Write Request Countcnt
CPU Usage [Basic]1-minute average system CPU usage rate%
Instance State [Basic]Instance Statusstate
Network In Bytes [Basic]Received bytesbytes
Network In Dropped [Basic]Receive Packet Dropcnt
Network In Packets [Basic]Received Packet Countcnt
Network Out Bytes [Basic]Transmission bytesbytes
Network Out Dropped [Basic]Transmission Packet Dropcnt
Network Out Packets [Basic]Transmission Packet Countcnt
Fig. Multi-node GPU Cluster [Cluster] Monitoring Metrics (Default Provided)
Performance ItemDetailed DescriptionUnit
Cluster GPU CountCluster 내 GPU Count SUM
  • Sum value of GPU Count of nodes in the cluster: Calculate the sum of GPU Count of each node in the same GPU CLUSTER
cnt
Cluster GPU Count In UseCluster 내 Job이 수행중인 GPU 수
  • Cluster internal Process using GPU count: the sum of the number of GPUs held by the process by parsing the ‘Processes:’ information at the bottom of the nvidia-smi result of the nodes in the same GPU CLUSTER
cnt
Cluster GPU UsageCluster internal GPU Utilization AVG
  • Cluster internal node GPU utilization Average value: Average calculation of each node’s GPU utilization value among nodes in the same GPU CLUSTER
%
Cluster GPU Memory Usage [Avg]Cluster GPU Memory Utilization AVG
  • Cluster node Memory utilization Average value: Average calculation of Memory utilization values of each node in the same GPU cluster
%
Table. Multi-node GPU Cluster [Cluster] additional monitoring metrics (Agent installation required)

Multi-node GPU Cluster [Node]

Performance ItemDetailed DescriptionUnit
Memory Total [Basic]Available memory bytesbytes
Memory Used [Basic]bytes of memory currently being usedbytes
Memory Swap In [Basic]Replaced memory bytesbytes
Memory Swap Out [Basic]Swapped out memory bytesbytes
Memory Free [Basic]unused memory bytesbytes
Disk Read Bytes [Basic]Read bytesbytes
Disk Read Requests [Basic] Disk Read Requests Countcnt
Disk Write Bytes [Basic]Write bytesbytes
Disk Write Requests [Basic]Write Request Countcnt
CPU Usage [Basic]1-minute average system CPU usage rate%
Instance State [Basic]Instance Statusstate
Network In Bytes [Basic]Received bytesbytes
Network In Dropped [Basic]Received Packet Dropcnt
Network In Packets [Basic]Received Packet Countcnt
Network Out Bytes [Basic]Transmission bytesbytes
Network Out Dropped [Basic]Transmission Packet Dropcnt
Network Out Packets [Basic]Transmission packet countcnt
Fig. Multi-node GPU Cluster [Node] Monitoring Metrics (Default Provided)
Performance ItemDetailed DescriptionUnit
GPU CountGPU countcnt
GPU TemperatureGPU temperature
GPU Usageutilization%
GPU Usage [Avg]GPU Overall Average Usage Rate(%)%
GPU Power CapGPU’s maximum power capacityW
GPU Power UsageGPU’s current power usageW
GPU Memory Usage [Avg]GPU Memory Uti. AVG%
GPU Count in useNode’s Job-running GPU countcnt
Execution Status for nvidia-sminvidia-smi command execution resultstatus
Core Usage [IO Wait]The ratio of CPU time spent in waiting state (disk waiting)%
Core Usage [System]The ratio of CPU time spent in kernel space%
-Core Usage [User]--The ratio of CPU time spent in user space--%-
CPU CoresThe number of CPU cores on the host. The maximum value of the unnormalized ratio is 100%* of the cores. The unnormalized ratio already reflects this value, and the maximum value is 100%* of the cores.cnt
CPU Usage [Active]Percentage of CPU time used excluding Idle and IOWait states (if all 4 cores are used at 100%: 400%)%
CPU Usage [Idle]The ratio of CPU time spent in idle state.%
CPU Usage [IO Wait]The percentage of CPU time spent in waiting state (disk waiting)%
CPU Usage [System]Percentage of CPU time used by the kernel (in case of using all 4 cores 100%: 400%)%
CPU Usage [User]Percentage of CPU time used in the user area. (In case of using all 4 cores 100%, 400%)%
CPU Usage/Core [Active]Percentage of CPU time used excluding Idle and IOWait states (normalized value by number of cores, 100% if all 4 cores are used at 100%)%
CPU Usage/Core [Idle]The ratio of CPU time spent in idle state.%
CPU Usage/Core [IO Wait]The ratio of CPU time spent in waiting state (disk waiting)%
CPU Usage/Core [System]Percentage of CPU time used by the kernel (normalized value by number of cores, 100% if all 4 cores are used at 100%)%
CPU Usage/Core [User]Percentage of CPU time used in the user area. (normalized value by number of cores, 100% if all 4 cores are used at 100%)%
Disk CPU Usage [IO Request]The ratio of CPU time spent executing input/output requests for the device (device bandwidth utilization). If this value is close to 100%, the device is in a saturated state.%
Disk Queue Size [Avg]The average queue length of requests executed for the device.num
Disk Read BytesThe number of bytes read from the device per second.bytes
Disk Read Bytes [Delta Avg]Average of system.diskio.read.bytes_delta for each diskbytes
Disk Read Bytes [Delta Max]Individual disks’ system.diskio.read.bytes_delta maximumbytes
Disk Read Bytes [Delta Min]Individual disks’ minimum system.diskio.read.bytes_deltabytes
Disk Read Bytes [Delta Sum]The sum of system.diskio.read.bytes_delta of individual disksbytes
Disk Read Bytes [Delta]Delta value of system.diskio.read.bytes for each diskbytes
Disk Read Bytes [Success]The total number of bytes read successfully. In Linux, it is assumed that the sector size is 512 and the value multiplied by the number of sectors read by 512bytes
Disk Read RequestsThe number of read requests for the disk device in 1 secondcnt
Disk Read Requests [Delta Avg]Average of system.diskio.read.count_delta for each diskcnt
Disk Read Requests [Delta Max]Maximum of system.diskio.read.count_delta for individual diskscnt
Disk Read Requests [Delta Min]Minimum of system.diskio.read.count_delta for each diskcnt
Disk Read Requests [Delta Sum]Sum of system.diskio.read.count_delta of individual diskscnt
Disk Read Requests [Success Delta]Individual disk’s system.diskio.read.count deltacnt
Disk Read Requests [Success]Total number of successful read completionscnt
Disk Request Size [Avg]The average size of requests executed for the device (unit: sector)num
Disk Service Time [Avg]The average service time (in milliseconds) for input requests executed on the device.ms
Disk Wait Time [Avg]The average time spent on requests executed for supported devices.ms
Disk Wait Time [Read]Disk Average Wait Timems
Disk Wait Time [Write]Disk Average Wait Timems
Disk Write Bytes [Delta Avg]Average of system.diskio.write.bytes_delta for each diskbytes
Disk Write Bytes [Delta Max]Maximum of system.diskio.write.bytes_delta for each diskbytes
Disk Write Bytes [Delta Min]Individual disks’ minimum system.diskio.write.bytes_deltabytes
Disk Write Bytes [Delta Sum]The sum of system.diskio.write.bytes_delta of individual disksbytes
Disk Write Bytes [Delta]Delta value of system.diskio.write.bytes for each diskbytes
Disk Write Bytes [Success]The total number of bytes written successfully. In Linux, it is assumed that the sector size is 512 and the value is multiplied by 512 to the number of sectors writtenbytes
Disk Write RequestsThe number of write requests to the disk device for 1 secondcnt
Disk Write Requests [Delta Avg]Average of system.diskio.write.count_delta of individual diskscnt
Disk Write Requests [Delta Max]Maximum of system.diskio.write.count_delta for each diskcnt
Disk Write Requests [Delta Min]Minimum of system.diskio.write.count_delta for individual diskscnt
Disk Write Requests [Delta Sum]Sum of system.diskio.write.count_delta of individual diskscnt
Disk Write Requests [Success Delta]Individual disk’s system.diskio.write.count deltacnt
Disk Write Requests [Success]Total number of writes completed successfullycnt
Disk Writes BytesThe number of bytes written to the device per second.bytes
Filesystem Hang Checkfilesystem(local/NFS) hang check (normal:1, abnormal:0)status
Filesystem NodesThe total number of file nodes in the file system.cnt
Filesystem Nodes [Free]The total number of available file nodes in the file system.cnt
Filesystem Size [Available]This is the disk space (in bytes) that can be used by unauthorized users.bytes
Filesystem Size [Free]Available disk space (bytes)bytes
Filesystem Size [Total]Total Disk Space (bytes)bytes
Filesystem UsageUsed Disk Space Percentage%
Filesystem Usage [Avg]Average of individual filesystem.used.pct%
Filesystem Usage [Inode]_inode usage rate%
Filesystem Usage [Max]Maximum value among individual filesystem usage percentages%
Filesystem Usage [Min]Minimum of individual filesystem used percentages%
Filesystem Usage [Total]-%
Filesystem UsedUsed Disk Space (bytes)bytes
Filesystem Used [Inode]Inode usagebytes
Memory FreeThe total amount of available memory (bytes). It does not include memory used by system cache and buffers (see system.memory.actual.free).bytes
Memory Free [Actual]Actual available memory (bytes). The calculation method varies depending on the OS, and in Linux, it is either MemAvailable from /proc/meminfo or calculated from available memory, cache, and buffer if meminfo is not available. On OSX, it is the sum of available memory and inactive memory. On Windows, it is the same as system.memory.free.bytes
Memory Free [Swap]Available swap memory.bytes
Memory TotalTotal Memorybytes
Memory Total [Swap]Total swap memory.bytes
Memory UsageUsed memory percentage
  • ((Memory Total - Memory Free) / Memory Total) * 100
  • Memory Free: Current available free memory capacity
%
Memory Usage [Actual]The percentage of memory actually used
  • ((Memory Total - Memory Available) / Memory Total) * 100 or ((Memory Total - (Memory Free + Buffers + Cached)) / Memory Total) * 100
  • Memory Free: The capacity of free memory currently available
  • Buffers: The capacity of memory used by buffers
  • Cached: The capacity of memory used by page cache
%
Memory Usage [Cache Swap]Cache swap usage rate%
Memory Usage [Swap]Used swap memory percentage%
Memory UsedUsed Memorybytes
Memory Used [Actual]Actual used memory (bytes). The value subtracted from the total memory by the used memory. The available memory is calculated differently depending on the OS (refer to system.actual.free)bytes
Memory Used [Swap]Used swap memory.bytes
CollisionsNetwork Collisionscnt
Network In BytesReceived byte countbytes
Network In Bytes [Delta Avg]Average of system.network.in.bytes_delta for each networkbytes
Network In Bytes [Delta Max]Maximum of system.network.in.bytes_delta for each networkbytes
Network In Bytes [Delta Min]Minimum of system.network.in.bytes_delta for each networkbytes
Network In Bytes [Delta Sum]Sum of each network’s system.network.in.bytes_deltabytes
Network In Bytes [Delta]Received byte count deltabytes
Network In DroppedThe number of packets deleted among incoming packetscnt
Network In ErrorsNumber of errors during receptioncnt
Network In PacketsReceived packet countcnt
Network In Packets [Delta Avg]Average of system.network.in.packets_delta for each networkcnt
Network In Packets [Delta Max]Individual networks’ system.network.in.packets_delta maximumcnt
Network In Packets [Delta Min]Minimum of system.network.in.packets_delta for each networkcnt
Network In Packets [Delta Sum]Sum of system.network.in.packets_delta of individual networkscnt
Network In Packets [Delta]Received packet count deltacnt
Network Out BytesTransmitted byte countbytes
Network Out Bytes [Delta Avg]Average of system.network.out.bytes_delta for each networkbytes
Network Out Bytes [Delta Max]Individual networks’ system.network.out.bytes_delta maximumbytes
Network Out Bytes [Delta Min]Minimum of system.network.out.bytes_delta for each networkbytes
Network Out Bytes [Delta Sum]The sum of system.network.out.bytes_delta of individual networksbytes
Network Out Bytes [Delta]Transmitted byte count deltabytes
Network Out DroppedNumber of packets dropped among outgoing packets. This value is not reported by the operating system, so it is always 0 in Darwin and BSDcnt
Network Out ErrorsNumber of errors during transmissioncnt
Network Out PacketsNumber of transmitted packetscnt
Network Out Packets [Delta Avg]Average of system.network.out.packets_delta for each networkcnt
Network Out Packets [Delta Max]Maximum of system.network.out.packets_delta for each networkcnt
Network Out Packets [Delta Min]Individual networks’ minimum system.network.out.packets_deltacnt
Network Out Packets [Delta Sum]Sum of system.network.out.packets_delta of individual networkscnt
Network Out Packets [Delta]Number of transmitted packets deltacnt
Open Connections [TCP]All open TCP connectionscnt
Open Connections [UDP]All open UDP connectionscnt
Port UsagePort usage available for connection%
SYN Sent SocketsNumber of sockets in SYN_SENT state (when connecting from local to remote)cnt
Kernel PID Maxkernel.pid_max valuecnt
Kernel Thread Maxkernel threads-max valuecnt
Process CPU UsagePercentage of CPU time consumed by the process after the last update. This value is similar to the %CPU value of the process displayed by the top command on Unix systems%
Process CPU Usage/CorePercentage of CPU time used by the process since the last event, normalized by the number of cores, with a value between 0~100%%
Process Memory Usagemain memory (RAM) where the process occupies a ratio%
Process Memory UsedResident Set size. The amount of memory a process occupies in RAM. In Windows, it is the current working set sizebytes
Process PIDProcess PIDPID
Process PPIDParent process’s pidPID
Processes [Dead].dead processes countcnt
Processes [Idle]idle process countcnt
Processes [Running]Number of running processescnt
Processes [Sleeping]sleeping processes countcnt
Processes [Stopped]Number of stopped processescnt
Processes [Total]Total number of processescnt
Processes [Unknown]Cannot search or unknown number of processescnt
Processes [Zombie]Number of zombie processescnt
Running Process Usageprocess usage rate%
Running ProcessesNumber of running processescnt
Running Thread UsageThread usage rate%
Running Threadsnumber of threads running in running processescnt
Instance Status_instance statusstate
Context Switchescontext switch count (per second)cnt
Load/Core [1 min]Load for the last 1 minute divided by the number of corescnt
Load/Core [15 min]The value of load divided by the number of cores for the last 15 minutescnt
Load/Core [5 min]The value of load divided by the number of cores over the last 5 minutescnt
Multipaths [Active]External storage connection path status = active countcnt
Multipaths [Failed]External storage connection path status = failed countcnt
Multipaths [Faulty]External storage connection path status = faulty countcnt
NTP Offsetlast sample’s measured offset (time difference between NTP server and local environment)num
Run Queue LengthExecution Waiting Queue Lengthnum
UptimeOS operation time (uptime). (milliseconds)ms
Context SwitchiesCPU context switch count (per second)cnt
Disk Read Bytes [Sec]number of bytes read from the windows logical disk in 1 secondcnt
Disk Read Time [Avg]Data Read Average Time (sec)sec
Disk Transfer Time [Avg]Disk average wait timesec
Disk UsageDisk Usage Rate%
Disk Write Bytes [Sec]number of bytes written to the windows logical disk in 1 secondcnt
Disk Write Time [Avg]Data Write Average Time (sec)sec
Pagingfile UsagePaging file usage rate%
Pool Used [Non Paged]Kernel memory Non-paged pool usagebytes
Pool Used [Paged]Paged Pool usage among kernel memorybytes
Process [Running]The number of processes currently runningcnt
Threads [Running]Number of threads currently runningcnt
Threads [Waiting]The number of threads waiting for processor timecnt
Table. Multi-node GPU Cluster [Node] additional monitoring metrics (Agent installation required)