Overview

1: Server Type
2: Monitoring Metrics

Service Overview

Multi-node GPU Cluster is a service that provides physical GPU servers without virtualization for large-scale high-performance AI calculations. It can cluster multiple GPUs using two or more bare metal servers with GPUs, and can be used conveniently with Samsung Cloud Platform’s high-performance storage and networking services.

Provided Features

Multi-node GPU Cluster provides the following functions.

Auto Provisioning and Management: Through the web-based Console, you can easily use the standard GPU Bare Metal model server with 8 GPU cards, from provisioning to resource and cost management.
Network Connection: Two or more Bare Metal Servers can be clustered through high-speed interconnects to process multiple GPUs, and by configuring the GPU Direct RDMA (Remote Direct Memory Access) environment, direct data IO between GPU memories is possible, enabling high-speed AI/Machine Learning calculations.
Storage Connection: It provides various additional connection storages other than OS disks. High-speed network and high-performance SSD NAS File Storage, Block Storage, and Object Storage that are directly linked can also be used in conjunction.
Network Setting Management: The server’s subnet/IP can be easily changed with the initially set value. NAT IP provides a management function that can be used or cancelled according to needs.
Monitoring: You can check the monitoring information of computing resources such as CPU, GPU, Memory, Disk, etc. through Cloud Monitoring. To use the Cloud Monitoring service for Multi-node GPU Cluster, you need to install the Agent. Please install the Agent for stable service use. For more information, please refer to Multi-node GPU Cluster Monitoring Metrics.

Component

Multi-node GPU Cluster provides GPU as a Bare Metal Server type with standard images and server types, and NVSwitch and NVLink are provided.

GPU(H100)

GPU (Graphic Processing Unit) is specialized in parallel calculations that can process a large amount of data quickly, enabling large-scale parallel calculation processing in fields such as artificial intelligence (AI) and data analysis.

The following are the specifications of the GPU Type provided by the Multi-node GPU Cluster service.

Classification	H100 Type
Product Provisioning Method	Bare Metal
GPU Architecture	NNVIDIA Hopper
GPU Memory	80GB
GPU Transistors	80 billion 4N TSMC
GPU Tensor Performance(based on FP16)	989.4 TFLOPs, 1,978.9 TFLOPs*
GPU Memory Bandwidth	3,352 GB/sec HBM3
GPU CUDA Cores	16,896 Cores
GPU Tensor Cores	528(4th Generation)
NVLink performance	NVLink 4
Total NVLink bandwidth	900 GB/s
NVLink Signaling Rate	25 Gbps (x18)
NVSwitch performance	NVSwitch 3
NVSwitch GPU bandwidth	900 GB/s
Total NVSwitch Aggregate Bandwidth	7.2TB/s

With Sparsity

Table. GPU Type Specifications

OS and GPU Driver Version

The operating systems (OS) supported by Multi-node GPU Cluster are as follows.

OS	OS version	GPU driver version
Ubuntu	22.04	535.86.10, 535.183.06

Table. Multi-node GPU Cluster OS and GPU Driver Version

Server Type

The server types provided by Multi-node GPU Cluster are as follows. For a detailed description of the server types provided by Multi-node GPU Cluster, please refer to Multi-node GPU Cluster server type.

g2c96h8_metal

Classification	Example	Detailed Description
Server Generation	g2	Provided server generation g2: g means GPU server, and 2 means generation
CPU	c96	Number of Cores c96: Assigned Core is a physical core
GPU	h8	GPU type and quantity h8: h means GPU type, and 8 means GPU quantity

Table. Multi-node GPU Cluster server type format

Preceding Service

This is a list of services that must be pre-configured before creating this service. Please refer to the guide provided for each service and prepare in advance for more details.

Service Category	Service	Detailed Description
Networking	VPC	A service that provides an independent virtual network in a cloud environment

Fig. Multi-node GPU Cluster Pre-service

1 - Server Type

Multi-node GPU Cluster Server Type

Multi-node GPU Cluster is divided based on the provided GPU Type, and the GPU used in the Multi-node GPU Cluster is determined by the server type selected when creating a GPU Node. Please select the server type according to the specifications of the application you want to run in the Multi-node GPU Cluster.

The server types supported by Multi-node GPU Cluster are in the following format:

g2c96h8_metal

Classification	Example	Detailed Description
Server Generation	g2	Provided server generation g2 g means GPU server specification 2 means generation
CPU	c96	Number of cores c96: Assigned cores are physical cores
GPU	h8	GPU type and quantity h8: h means GPU type, and 8 means GPU quantity

Table. Multi-node GPU Cluster server type format

g2 Server Type

The g2 server type is a GPU Bare Metal Server using NVIDIA H100 Tensor Core GPU, suitable for large-scale high-performance AI computing.

Provides up to 8 NVIDIA H100 Tensor Core GPUs
Each GPU has 16,896 CUDA cores and 528 Tensor cores
Supports up to 96 vCPUs and 1,920 GB of memory
Supports up to 100 Gbps networking speed
900GB/s GPU and NVIDIA NVSwitch P2P communication

Server Type	GPU	GPU Memory	CPU(Core)	Memory	Disk	GPU P2P
g2c96h8_metal	H100	640 GB	96 vCore	2 TB	SSD(OS) 960 GB * 2, NVMeSSD 3.84 TB * 4	900GB/s NVSwitch

Table. Multi-node GPU Cluster server type specification > H100 server type

2 - Monitoring Metrics

Multi-node GPU Cluster monitoring metrics

The following table shows the monitoring metrics of Multi-node GPU Cluster that can be checked through Cloud Monitoring.

Guide

Multi-node GPU Cluster requires the user to install the Agent through the guide to view monitoring metrics. Please install the Agent before using the stable service. For the Agent installation method and detailed Cloud Monitoring usage, please refer to the Cloud Monitoring guide.

Multi-node GPU Cluster [Cluster]

Performance Item	Detailed Description	Unit
Memory Total [Basic]	Available memory bytes	bytes
Memory Used [Basic]	Currently used memory bytes	bytes
Memory Swap In [Basic]	Replaced memory bytes	bytes
Memory Swap Out [Basic]	Swapped out memory bytes	bytes
Memory Free [Basic]	Unused memory bytes	bytes
Disk Read Bytes [Basic]	Read bytes	bytes
Disk Read Requests [Basic]	Number of Read Requests	cnt
Disk Write Bytes [Basic]	Write bytes	bytes
Disk Write Requests [Basic]	Write Request Count	cnt
CPU Usage [Basic]	1-minute average system CPU usage rate	%
Instance State [Basic]	Instance Status	state
Network In Bytes [Basic]	Received bytes	bytes
Network In Dropped [Basic]	Receive Packet Drop	cnt
Network In Packets [Basic]	Received Packet Count	cnt
Network Out Bytes [Basic]	Transmission bytes	bytes
Network Out Dropped [Basic]	Transmission Packet Drop	cnt
Network Out Packets [Basic]	Transmission Packet Count	cnt

Fig. Multi-node GPU Cluster [Cluster] Monitoring Metrics (Default Provided)

Performance Item	Detailed Description	Unit
Cluster GPU Count	Cluster 내 GPU Count SUM Sum value of GPU Count of nodes in the cluster: Calculate the sum of GPU Count of each node in the same GPU CLUSTER	cnt
Cluster GPU Count In Use	Cluster 내 Job이 수행중인 GPU 수 Cluster internal Process using GPU count: the sum of the number of GPUs held by the process by parsing the ‘Processes:’ information at the bottom of the nvidia-smi result of the nodes in the same GPU CLUSTER	cnt
Cluster GPU Usage	Cluster internal GPU Utilization AVG Cluster internal node GPU utilization Average value: Average calculation of each node’s GPU utilization value among nodes in the same GPU CLUSTER	%
Cluster GPU Memory Usage [Avg]	Cluster GPU Memory Utilization AVG Cluster node Memory utilization Average value: Average calculation of Memory utilization values of each node in the same GPU cluster	%

Table. Multi-node GPU Cluster [Cluster] additional monitoring metrics (Agent installation required)

Multi-node GPU Cluster [Node]

Performance Item	Detailed Description	Unit
Memory Total [Basic]	Available memory bytes	bytes
Memory Used [Basic]	bytes of memory currently being used	bytes
Memory Swap In [Basic]	Replaced memory bytes	bytes
Memory Swap Out [Basic]	Swapped out memory bytes	bytes
Memory Free [Basic]	unused memory bytes	bytes
Disk Read Bytes [Basic]	Read bytes	bytes
Disk Read Requests [Basic] Disk Read Requests Count	cnt
Disk Write Bytes [Basic]	Write bytes	bytes
Disk Write Requests [Basic]	Write Request Count	cnt
CPU Usage [Basic]	1-minute average system CPU usage rate	%
Instance State [Basic]	Instance Status	state
Network In Bytes [Basic]	Received bytes	bytes
Network In Dropped [Basic]	Received Packet Drop	cnt
Network In Packets [Basic]	Received Packet Count	cnt
Network Out Bytes [Basic]	Transmission bytes	bytes
Network Out Dropped [Basic]	Transmission Packet Drop	cnt
Network Out Packets [Basic]	Transmission packet count	cnt

Fig. Multi-node GPU Cluster [Node] Monitoring Metrics (Default Provided)

Performance Item	Detailed Description	Unit
GPU Count	GPU count	cnt
GPU Temperature	GPU temperature	℃
GPU Usage	utilization	%
GPU Usage [Avg]	GPU Overall Average Usage Rate(%)	%
GPU Power Cap	GPU’s maximum power capacity	W
GPU Power Usage	GPU’s current power usage	W
GPU Memory Usage [Avg]	GPU Memory Uti. AVG	%
GPU Count in use	Node’s Job-running GPU count	cnt
Execution Status for nvidia-smi	nvidia-smi command execution result	status
Core Usage [IO Wait]	The ratio of CPU time spent in waiting state (disk waiting)	%
Core Usage [System]	The ratio of CPU time spent in kernel space	%
-Core Usage [User]-	-The ratio of CPU time spent in user space-	-%-
CPU Cores	The number of CPU cores on the host. The maximum value of the unnormalized ratio is 100%* of the cores. The unnormalized ratio already reflects this value, and the maximum value is 100%* of the cores.	cnt
CPU Usage [Active]	Percentage of CPU time used excluding Idle and IOWait states (if all 4 cores are used at 100%: 400%)	%
CPU Usage [Idle]	The ratio of CPU time spent in idle state.	%
CPU Usage [IO Wait]	The percentage of CPU time spent in waiting state (disk waiting)	%
CPU Usage [System]	Percentage of CPU time used by the kernel (in case of using all 4 cores 100%: 400%)	%
CPU Usage [User]	Percentage of CPU time used in the user area. (In case of using all 4 cores 100%, 400%)	%
CPU Usage/Core [Active]	Percentage of CPU time used excluding Idle and IOWait states (normalized value by number of cores, 100% if all 4 cores are used at 100%)	%
CPU Usage/Core [Idle]	The ratio of CPU time spent in idle state.	%
CPU Usage/Core [IO Wait]	The ratio of CPU time spent in waiting state (disk waiting)	%
CPU Usage/Core [System]	Percentage of CPU time used by the kernel (normalized value by number of cores, 100% if all 4 cores are used at 100%)	%
CPU Usage/Core [User]	Percentage of CPU time used in the user area. (normalized value by number of cores, 100% if all 4 cores are used at 100%)	%
Disk CPU Usage [IO Request]	The ratio of CPU time spent executing input/output requests for the device (device bandwidth utilization). If this value is close to 100%, the device is in a saturated state.	%
Disk Queue Size [Avg]	The average queue length of requests executed for the device.	num
Disk Read Bytes	The number of bytes read from the device per second.	bytes
Disk Read Bytes [Delta Avg]	Average of system.diskio.read.bytes_delta for each disk	bytes
Disk Read Bytes [Delta Max]	Individual disks’ system.diskio.read.bytes_delta maximum	bytes
Disk Read Bytes [Delta Min]	Individual disks’ minimum system.diskio.read.bytes_delta	bytes
Disk Read Bytes [Delta Sum]	The sum of system.diskio.read.bytes_delta of individual disks	bytes
Disk Read Bytes [Delta]	Delta value of system.diskio.read.bytes for each disk	bytes
Disk Read Bytes [Success]	The total number of bytes read successfully. In Linux, it is assumed that the sector size is 512 and the value multiplied by the number of sectors read by 512	bytes
Disk Read Requests	The number of read requests for the disk device in 1 second	cnt
Disk Read Requests [Delta Avg]	Average of system.diskio.read.count_delta for each disk	cnt
Disk Read Requests [Delta Max]	Maximum of system.diskio.read.count_delta for individual disks	cnt
Disk Read Requests [Delta Min]	Minimum of system.diskio.read.count_delta for each disk	cnt
Disk Read Requests [Delta Sum]	Sum of system.diskio.read.count_delta of individual disks	cnt
Disk Read Requests [Success Delta]	Individual disk’s system.diskio.read.count delta	cnt
Disk Read Requests [Success]	Total number of successful read completions	cnt
Disk Request Size [Avg]	The average size of requests executed for the device (unit: sector)	num
Disk Service Time [Avg]	The average service time (in milliseconds) for input requests executed on the device.	ms
Disk Wait Time [Avg]	The average time spent on requests executed for supported devices.	ms
Disk Wait Time [Read]	Disk Average Wait Time	ms
Disk Wait Time [Write]	Disk Average Wait Time	ms
Disk Write Bytes [Delta Avg]	Average of system.diskio.write.bytes_delta for each disk	bytes
Disk Write Bytes [Delta Max]	Maximum of system.diskio.write.bytes_delta for each disk	bytes
Disk Write Bytes [Delta Min]	Individual disks’ minimum system.diskio.write.bytes_delta	bytes
Disk Write Bytes [Delta Sum]	The sum of system.diskio.write.bytes_delta of individual disks	bytes
Disk Write Bytes [Delta]	Delta value of system.diskio.write.bytes for each disk	bytes
Disk Write Bytes [Success]	The total number of bytes written successfully. In Linux, it is assumed that the sector size is 512 and the value is multiplied by 512 to the number of sectors written	bytes
Disk Write Requests	The number of write requests to the disk device for 1 second	cnt
Disk Write Requests [Delta Avg]	Average of system.diskio.write.count_delta of individual disks	cnt
Disk Write Requests [Delta Max]	Maximum of system.diskio.write.count_delta for each disk	cnt
Disk Write Requests [Delta Min]	Minimum of system.diskio.write.count_delta for individual disks	cnt
Disk Write Requests [Delta Sum]	Sum of system.diskio.write.count_delta of individual disks	cnt
Disk Write Requests [Success Delta]	Individual disk’s system.diskio.write.count delta	cnt
Disk Write Requests [Success]	Total number of writes completed successfully	cnt
Disk Writes Bytes	The number of bytes written to the device per second.	bytes
Filesystem Hang Check	filesystem(local/NFS) hang check (normal:1, abnormal:0)	status
Filesystem Nodes	The total number of file nodes in the file system.	cnt
Filesystem Nodes [Free]	The total number of available file nodes in the file system.	cnt
Filesystem Size [Available]	This is the disk space (in bytes) that can be used by unauthorized users.	bytes
Filesystem Size [Free]	Available disk space (bytes)	bytes
Filesystem Size [Total]	Total Disk Space (bytes)	bytes
Filesystem Usage	Used Disk Space Percentage	%
Filesystem Usage [Avg]	Average of individual filesystem.used.pct	%
Filesystem Usage [Inode]	_inode usage rate	%
Filesystem Usage [Max]	Maximum value among individual filesystem usage percentages	%
Filesystem Usage [Min]	Minimum of individual filesystem used percentages	%
Filesystem Usage [Total]	-	%
Filesystem Used	Used Disk Space (bytes)	bytes
Filesystem Used [Inode]	Inode usage	bytes
Memory Free	The total amount of available memory (bytes). It does not include memory used by system cache and buffers (see system.memory.actual.free).	bytes
Memory Free [Actual]	Actual available memory (bytes). The calculation method varies depending on the OS, and in Linux, it is either MemAvailable from /proc/meminfo or calculated from available memory, cache, and buffer if meminfo is not available. On OSX, it is the sum of available memory and inactive memory. On Windows, it is the same as system.memory.free.	bytes
Memory Free [Swap]	Available swap memory.	bytes
Memory Total	Total Memory	bytes
Memory Total [Swap]	Total swap memory.	bytes
Memory Usage	Used memory percentage ((Memory Total - Memory Free) / Memory Total) * 100 Memory Free: Current available free memory capacity	%
Memory Usage [Actual]	The percentage of memory actually used ((Memory Total - Memory Available) / Memory Total) * 100 or ((Memory Total - (Memory Free + Buffers + Cached)) / Memory Total) * 100 Memory Free: The capacity of free memory currently available Buffers: The capacity of memory used by buffers Cached: The capacity of memory used by page cache	%
Memory Usage [Cache Swap]	Cache swap usage rate	%
Memory Usage [Swap]	Used swap memory percentage	%
Memory Used	Used Memory	bytes
Memory Used [Actual]	Actual used memory (bytes). The value subtracted from the total memory by the used memory. The available memory is calculated differently depending on the OS (refer to system.actual.free)	bytes
Memory Used [Swap]	Used swap memory.	bytes
Collisions	Network Collisions	cnt
Network In Bytes	Received byte count	bytes
Network In Bytes [Delta Avg]	Average of system.network.in.bytes_delta for each network	bytes
Network In Bytes [Delta Max]	Maximum of system.network.in.bytes_delta for each network	bytes
Network In Bytes [Delta Min]	Minimum of system.network.in.bytes_delta for each network	bytes
Network In Bytes [Delta Sum]	Sum of each network’s system.network.in.bytes_delta	bytes
Network In Bytes [Delta]	Received byte count delta	bytes
Network In Dropped	The number of packets deleted among incoming packets	cnt
Network In Errors	Number of errors during reception	cnt
Network In Packets	Received packet count	cnt
Network In Packets [Delta Avg]	Average of system.network.in.packets_delta for each network	cnt
Network In Packets [Delta Max]	Individual networks’ system.network.in.packets_delta maximum	cnt
Network In Packets [Delta Min]	Minimum of system.network.in.packets_delta for each network	cnt
Network In Packets [Delta Sum]	Sum of system.network.in.packets_delta of individual networks	cnt
Network In Packets [Delta]	Received packet count delta	cnt
Network Out Bytes	Transmitted byte count	bytes
Network Out Bytes [Delta Avg]	Average of system.network.out.bytes_delta for each network	bytes
Network Out Bytes [Delta Max]	Individual networks’ system.network.out.bytes_delta maximum	bytes
Network Out Bytes [Delta Min]	Minimum of system.network.out.bytes_delta for each network	bytes
Network Out Bytes [Delta Sum]	The sum of system.network.out.bytes_delta of individual networks	bytes
Network Out Bytes [Delta]	Transmitted byte count delta	bytes
Network Out Dropped	Number of packets dropped among outgoing packets. This value is not reported by the operating system, so it is always 0 in Darwin and BSD	cnt
Network Out Errors	Number of errors during transmission	cnt
Network Out Packets	Number of transmitted packets	cnt
Network Out Packets [Delta Avg]	Average of system.network.out.packets_delta for each network	cnt
Network Out Packets [Delta Max]	Maximum of system.network.out.packets_delta for each network	cnt
Network Out Packets [Delta Min]	Individual networks’ minimum system.network.out.packets_delta	cnt
Network Out Packets [Delta Sum]	Sum of system.network.out.packets_delta of individual networks	cnt
Network Out Packets [Delta]	Number of transmitted packets delta	cnt
Open Connections [TCP]	All open TCP connections	cnt
Open Connections [UDP]	All open UDP connections	cnt
Port Usage	Port usage available for connection	%
SYN Sent Sockets	Number of sockets in SYN_SENT state (when connecting from local to remote)	cnt
Kernel PID Max	kernel.pid_max value	cnt
Kernel Thread Max	kernel threads-max value	cnt
Process CPU Usage	Percentage of CPU time consumed by the process after the last update. This value is similar to the %CPU value of the process displayed by the top command on Unix systems	%
Process CPU Usage/Core	Percentage of CPU time used by the process since the last event, normalized by the number of cores, with a value between 0~100%	%
Process Memory Usage	main memory (RAM) where the process occupies a ratio	%
Process Memory Used	Resident Set size. The amount of memory a process occupies in RAM. In Windows, it is the current working set size	bytes
Process PID	Process PID	PID
Process PPID	Parent process’s pid	PID
Processes [Dead]	.dead processes count	cnt
Processes [Idle]	idle process count	cnt
Processes [Running]	Number of running processes	cnt
Processes [Sleeping]	sleeping processes count	cnt
Processes [Stopped]	Number of stopped processes	cnt
Processes [Total]	Total number of processes	cnt
Processes [Unknown]	Cannot search or unknown number of processes	cnt
Processes [Zombie]	Number of zombie processes	cnt
Running Process Usage	process usage rate	%
Running Processes	Number of running processes	cnt
Running Thread Usage	Thread usage rate	%
Running Threads	number of threads running in running processes	cnt
Instance Status	_instance status	state
Context Switches	context switch count (per second)	cnt
Load/Core [1 min]	Load for the last 1 minute divided by the number of cores	cnt
Load/Core [15 min]	The value of load divided by the number of cores for the last 15 minutes	cnt
Load/Core [5 min]	The value of load divided by the number of cores over the last 5 minutes	cnt
Multipaths [Active]	External storage connection path status = active count	cnt
Multipaths [Failed]	External storage connection path status = failed count	cnt
Multipaths [Faulty]	External storage connection path status = faulty count	cnt
NTP Offset	last sample’s measured offset (time difference between NTP server and local environment)	num
Run Queue Length	Execution Waiting Queue Length	num
Uptime	OS operation time (uptime). (milliseconds)	ms
Context Switchies	CPU context switch count (per second)	cnt
Disk Read Bytes [Sec]	number of bytes read from the windows logical disk in 1 second	cnt
Disk Read Time [Avg]	Data Read Average Time (sec)	sec
Disk Transfer Time [Avg]	Disk average wait time	sec
Disk Usage	Disk Usage Rate	%
Disk Write Bytes [Sec]	number of bytes written to the windows logical disk in 1 second	cnt
Disk Write Time [Avg]	Data Write Average Time (sec)	sec
Pagingfile Usage	Paging file usage rate	%
Pool Used [Non Paged]	Kernel memory Non-paged pool usage	bytes
Pool Used [Paged]	Paged Pool usage among kernel memory	bytes
Process [Running]	The number of processes currently running	cnt
Threads [Running]	Number of threads currently running	cnt
Threads [Waiting]	The number of threads waiting for processor time	cnt

Table. Multi-node GPU Cluster [Node] additional monitoring metrics (Agent installation required)