This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Multi-node GPU Cluster

1: Overview

1.1: Server type
1.2: Monitoring Metrics

2: How-to guides

2.1: Cluster Fabric Management
2.2: Install ServiceWatch Agent
2.3: Multi-node GPU Cluster Service Scope and Inspection Guide

3: Release Note

1 - Overview

Service Overview

Multi-node GPU Cluster is a service that provides physical GPU servers without virtualization for large-scale high-performance AI computation. You can use two or more Bare Metal Servers equipped with GPUs to cluster multiple GPUs, and conveniently operate GPU servers in conjunction with Samsung Cloud Platform’s high‑performance storage and networking services.

Provided Features

The Multi-node GPU Cluster provides the following features.

Auto Provisioning and Management: Through the web-based Console, you can easily provision servers of the standard GPU Bare Metal model equipped with 8 GPUs and manage resources and costs.
Network Connection: You can cluster multiple GPUs on two or more Bare Metal Servers via high‑speed interconnects, and by configuring a GPU Direct RDMA (Remote Direct Memory Access) environment, you can directly process data I/O between GPU memories, enabling high‑speed AI/Machine Learning computation.
Storage Connection: Provides various additional attached storage besides the OS disk. * High-performance SSD NAS File Storage, Block Storage, and Object Storage directly integrated with a high-speed network can also be used together.
Network Configuration Management: The server’s subnet/IP can be easily changed from the values set at initial creation. * NAT IP provides a management feature that allows you to enable or disable it as needed.
Monitoring: You can view monitoring information for computing resources such as CPU, GPU, Memory, and Disk through Cloud Monitoring. * To use the Cloud Monitoring service of a Multi-node GPU Cluster, you need to install the Agent. * Please install the Agent to ensure stable service. * For more details, please refer to Multi-node GPU Cluster Monitoring Metrics.
Terraform Provision: Provides an IaC environment via Terraform.

Component

Multi-node GPU Cluster provides GPUs as a Bare Metal Sever type with standard images and server types. NVSwitch and NVLink are provided.

Specifications by GPU Type

GPU (Graphic Processing Unit) is specialized for parallel operations that process large amounts of data quickly, enabling large-scale parallel computation in fields such as artificial intelligence (AI) and data analysis.

The following are the specifications of GPU types offered by the Multi-node GPU Cluster service.

Category	H100 Type	B300 Type
GPU Architecture	NVIDIA Hopper	NVIDIA Blackwell Ultra
GPU Memory	80 GiB	268 GiB
GPU Transistors	80 billion 4N TSMC	208 billion 4NP TSMC
FP16 Tensor Core (Dense)	989 TFLOPs	2.25 PFLOPs
FP8 Tensor Core (Dense)	1979 TFLOPs	4.5 PFLOPs
FP4 Tensor Core (Dense)	Not supported	13.5 PFLOPs
GPU Memory Bandwidth	3,352 GB/s HBM3	8 TB/s HBM3e
NVLink performance	NVLink 4	NVLink 5
NVLink Signaling Rate	25 GB/s (x18)	50 GB/s (x18)
NVSwitch GPU-to-GPU bandwidth	900 GB/s	1.8 TB/s
Total NVSwitch aggregate bandwidth	7.2 TB/s	14.4 TB/s

Table. GPU Type specifications

OS and GPU driver version

The operating systems (OS) supported by the Multi-node GPU Cluster are as follows.

OS	OS version	GPU driver version
Ubuntu	22.04	535.86.10, 535.183.06
Ubuntu	24.04	580.105.08

Table. Multi-node GPU Cluster OS and GPU driver version

Server type

The format of server types provided by the Multi-node GPU Cluster is as follows.

Example: when the server type is g2c96h8_metal

Category	example	Detailed description
Server generation	g2	Provided server generation g2: g means GPU server, and 2 means generation
CPU	c96	Number of cores c96: Allocated cores are physical cores
GPU	h8	GPU type and quantity h8: h means GPU type, and 8 means GPU quantity

Table. Multi-node GPU Cluster server type format

Reference

For detailed information about the server types provided by Multi-node GPU Cluster, refer to Multi-node GPU Cluster Server Types.

Preceding Service

This is a list of services that must be pre-configured before creating the service. Please refer to the guide provided for each service and prepare in advance.

Service Category	service	Detailed description
Networking	VPC	A service that provides an isolated virtual network in a cloud environment

Table. Multi-node GPU Cluster preliminary service

1.1 - Server type

Multi-node GPU Cluster server type

Multi-node GPU Cluster is categorized based on the GPU Type it provides, and the GPU used in a Multi-node GPU Cluster is determined by the server type selected when creating a GPU Node. Select the server type based on the specifications of the application you want to run on a multi-node GPU cluster.

The server types supported by the Multi-node GPU Cluster are as follows.

Example: when the server type is g2c96h8_metal.

Category	example	Detailed description
Server generation	g2	Provided server generation g2 g means GPU server specification 2 means generation
CPU	c96	Number of cores c96: Allocated cores are physical cores
GPU	h8	GPU type and quantity h8: h means GPU type, and 8 means GPU quantity

Table. Multi-node GPU Cluster server type format

g2 server type

The g2 server type is a GPU Bare Metal Server that uses NVIDIA H100 SXM GPUs, suitable for large-scale high-performance AI computation.

8 NVIDIA Hopper Architecture-based H100 GPUs provided
Provides 1,979 TFLOPS FP8 Tensor Core performance per GPU, 989 TFLOPS FP16 Tensor Core performance.
Supports up to 96 vCPUs and 2,048 GB of memory
Supports up to 1,600 Gb/s NVIDIA InfiniBand RDMA network.
Service network up to 100 Gbps
900 GB/s GPU P2P communication via NVSwitch within a node

Server type	GPU	GPU Memory	CPU(Core)	Memory	Disk	GPU P2P
g2c96h8_metal	H100	640 GiB	96 vCore	2 TB	SSD (OS) 960 GB * 2, NVMeSSD 3.84 TB * 4	900 GB/s NVSwitch

Table. Multi-node GPU Cluster server type specifications > H100 server type

g3 server type

The g3 server type is a GPU Bare Metal Server that uses NVIDIA B300 SXM GPUs, suitable not only for large-scale high-performance AI computation but also for LLM inference and AI deployment for generative AI.

8 NVIDIA Blackwell Ultra Architecture-based B300 GPUs provided
Provides 13.5 PFLOPS FP4 Tensor Core and 4.5 PFLOPS FP8 Tensor Core performance per GPU.
Supports up to 128 vCPUs and 4,096 GB of memory
Supports up to 6,400 Gb/s NVIDIA InfiniBand RDMA network
Service network up to 100 Gbps
1.8 TB/s GPU P2P communication via NVSwitch within a node

Server type	GPU	GPU Memory	CPU(Core)	Memory	Disk	GPU P2P
g3c128b8_metal	B300	2.1 TiB	128 vCore	4 TB	SSD (OS) 960 GB * 2, NVMeSSD 3.84 TB * 4	1.8 TB/s NVSwitch

Table. Multi-node GPU Cluster server type specifications > B300 server type

1.2 - Monitoring Metrics

Cloud Monitoring service termination notice

According to Samsung Cloud Platform’s policy, the Cloud Monitoring service is scheduled to be discontinued in September 2026.
Accordingly, from after the September 2026 release, resource monitoring of the Samsung Cloud Platform via Cloud Monitoring will no longer be possible.

With a new alternative service, you can continuously perform resource monitoring by leveraging ServiceWatch released in October 2025.
ServiceWatch provides more modern and powerful features, replacing Cloud Monitoring to deliver a seamless monitoring environment.

If you are collecting metrics and logs through the Cloud Monitoring Agent, you need to switch to the ServiceWatch Agent.

For more details about ServiceWatch, please refer to ServiceWatch Overview.
Detailed information about ServiceWatch Agent: please refer to the ServiceWatch Agent

Multi-node GPU Cluster Monitoring Metrics

The table below shows the monitoring metrics of a Multi-node GPU Cluster that can be viewed through Cloud Monitoring.

Guide

In a Multi-node GPU Cluster, users must install the Agent themselves via the guide to view monitoring metrics. Before using the stable service, please be sure to install the Agent. For instructions on installing the Agent and detailed usage of Cloud Monitoring, refer to the Cloud Monitoring guide.

Multi-node GPU Cluster [Cluster]

Performance items	Detailed description	unit
Memory Total [Basic]	bytes of usable memory	bytes
Memory Used [Basic]	Current memory usage in bytes	bytes
Memory Swap In [Basic]	bytes of the replaced memory	bytes
Memory Swap Out [Basic]	bytes of the replaced memory	bytes
Memory Free [Basic]	bytes of unused memory	bytes
Disk Read Bytes [Basic]	Read bytes	bytes
Disk Read Requests [Basic]	Number of read requests	cnt
Disk Write Bytes [Basic]	write bytes	bytes
Disk Write Requests [Basic]	Number of write requests	cnt
CPU Usage [Basic]	Average system CPU usage over 1 minute	%
Instance State [Basic]	Instance status	state
Network In Bytes [Basic]	Received bytes	bytes
Network In Dropped [Basic]	Incoming packet drop	cnt
Network In Packets [Basic]	Number of received packets	cnt
Network Out Bytes [Basic]	sent bytes	bytes
Network Out Dropped [Basic]	Transmit packet drop	cnt
Network Out Packets [Basic]	Number of transmitted packets	cnt

Table. Multi-node GPU Cluster [Cluster] Monitoring Metrics (default)

Performance items	Detailed description	unit
Cluster GPU Count	GPU Count SUM in Cluster Sum of GPU Count for nodes in the cluster: calculate the total GPU Count of each node within the same GPU CLUSTER	cnt
Cluster GPU Count In Use	Number of GPUs being used by jobs within the cluster Number of GPUs used by processes within the cluster: sum of GPUs occupied by processes, parsed from the ‘Processes:’ section at the bottom of nvidia-smi output of nodes in the same GPU cluster	cnt
Cluster GPU Usage	GPU Utilization AVG within the cluster Cluster node GPU utilization average value: calculate the average of each node’s GPU utilization values within the same GPU cluster	%
Cluster GPU Memory Usage [Avg]	Cluster GPU Memory Utilization AVG Average Memory utilization of nodes within the cluster: calculate the average of each node’s Memory utilization values among nodes in the same GPU cluster	%

Table. Multi-node GPU Cluster [Cluster] Additional monitoring metrics (Agent installation required)

Multi-node GPU Cluster [Node]

Performance items	Detailed description	unit
Memory Total [Basic]	bytes of usable memory	bytes
Memory Used [Basic]	Current memory usage in bytes	bytes
Memory Swap In [Basic]	bytes of the replaced memory	bytes
Memory Swap Out [Basic]	bytes of the replaced memory	bytes
Memory Free [Basic]	bytes of unused memory	bytes
Disk Read Bytes [Basic]	Read bytes	bytes
Disk Read Requests [Basic]	Number of read requests	cnt
Disk Write Bytes [Basic]	write bytes	bytes
Disk Write Requests [Basic]	Number of write requests	cnt
CPU Usage [Basic]	Average system CPU usage over 1 minute	%
Instance State [Basic]	Instance status	state
Network In Bytes [Basic]	Received bytes	bytes
Network In Dropped [Basic]	Incoming packet drop	cnt
Network In Packets [Basic]	Number of received packets	cnt
Network Out Bytes [Basic]	sent bytes	bytes
Network Out Dropped [Basic]	Transmit packet drop	cnt
Network Out Packets [Basic]	Number of transmitted packets	cnt

Table. Multi-node GPU Cluster [Node] Monitoring Metrics (provided by default)

Performance items	Detailed description	unit
GPU Count	Number of GPUs	cnt
GPU Temperature	GPU temperature	℃
GPU Usage	utilization	%
GPU Usage [Avg]	Overall average GPU utilization (%)	%
GPU Power Cap	Maximum power capacity of the GPU	W
GPU Power Usage	Current GPU power usage	W
GPU Memory Usage [Avg]	GPU Memory Uti. AVG	%
GPU Count in use	Number of GPUs in use by jobs on the node	cnt
Execution Status for nvidia-smi	Result of running the nvidia-smi command	status
Core Usage [IO Wait]	Ratio of CPU time spent in wait state (disk wait)	%
Core Usage [System]	Proportion of CPU time spent in kernel space	%
Core Usage [User]	Proportion of CPU time spent in user space	%
CPU Cores	The number of CPU cores on the host. The maximum value of the unnormalized ratio is 100%* of a core. The unnormalized ratio already incorporates this value, and its maximum is 100%* of a core.	cnt
CPU Usage [Active]	Percentage of CPU time used excluding Idle and IOWait states (when all 4 cores are used at 100%: 400%)	%
CPU Usage [Idle]	It is the proportion of CPU time spent in idle state.	%
CPU Usage [IO Wait]	This is the proportion of CPU time spent in a waiting state (disk wait).	%
CPU Usage [System]	Percentage of CPU time used by the kernel (when all 4 cores are used at 100%: 400%)	%
CPU Usage [User]	Percentage of CPU time used in user space. (If all 4 cores are used at 100%, it is 400%)	%
CPU Usage/Core [Active]	Percentage of CPU time used excluding Idle and IOWait states (value normalized by the number of cores; 100% when all four cores are fully utilized)	%
CPU Usage/Core [Idle]	It is the proportion of CPU time spent in idle state.	%
CPU Usage/Core [IO Wait]	This is the proportion of CPU time spent in a waiting state (disk wait).	%
CPU Usage/Core [System]	Percentage of CPU time used by the kernel (value normalized by the number of cores; 100% when all 4 cores are fully utilized)	%
CPU Usage/Core [User]	Percentage of CPU time used in user space. (Value normalized by the number of cores; 100% when all 4 cores are fully utilized)	%
Disk CPU Usage [IO Request]	It is the proportion of CPU time during which I/O requests for the device were executed (device bandwidth utilization). If this value approaches 100%, the device becomes saturated.	%
Disk Queue Size [Avg]	The average queue length of requests executed for the device.	num
Disk Read Bytes	The number of bytes read per second from the device.	bytes
Disk Read Bytes [Delta Avg]	Average of system.diskio.read.bytes_delta for individual disks	bytes
Disk Read Bytes [Delta Max]	Maximum system.diskio.read.bytes_delta of individual disks	bytes
Disk Read Bytes [Delta Min]	Minimum system.diskio.read.bytes_delta of individual disks	bytes
Disk Read Bytes [Delta Sum]	Sum of the system.diskio.read.bytes_delta of individual disks	bytes
Disk Read Bytes [Delta]	Delta of the system.diskio.read.bytes value for each disk	bytes
Disk Read Bytes [Success]	Total number of bytes successfully read. On Linux, assuming a sector size of 512, it is the number of sectors read multiplied by 512.	bytes
Disk Read Requests	Number of read requests to the disk device per second	cnt
Disk Read Requests [Delta Avg]	Average of the system.diskio.read.count_delta for individual disks	cnt
Disk Read Requests [Delta Max]	Maximum system.diskio.read.count_delta for individual disks	cnt
Disk Read Requests [Delta Min]	Minimum of system.diskio.read.count_delta for individual disks	cnt
Disk Read Requests [Delta Sum]	Sum of system.diskio.read.count_delta of individual disks	cnt
Disk Read Requests [Success Delta]	Delta of system.diskio.read.count for each disk	cnt
Disk Read Requests [Success]	Total number of successful reads	cnt
Disk Request Size [Avg]	Average size of requests executed on the device (unit: sectors).	num
Disk Service Time [Avg]	Average service time (ms) of input requests executed on the device.	ms
Disk Wait Time [Avg]	Average time taken for requests executed on the supported device.	ms
Disk Wait Time [Read]	Average disk wait time	ms
Disk Wait Time [Write]	Average disk wait time	ms
Disk Write Bytes [Delta Avg]	Average of system.diskio.write.bytes_delta for each disk	bytes
Disk Write Bytes [Delta Max]	Maximum system.diskio.write.bytes_delta of individual disks	bytes
Disk Write Bytes [Delta Min]	Minimum of system.diskio.write.bytes_delta for individual disks	bytes
Disk Write Bytes [Delta Sum]	Sum of the system.diskio.write.bytes_delta of individual disks	bytes
Disk Write Bytes [Delta]	Delta of the system.diskio.write.bytes value for each disk	bytes
Disk Write Bytes [Success]	Total number of bytes successfully written. On Linux, assuming a sector size of 512, it is the number of sectors written multiplied by 512.	bytes
Disk Write Requests	Number of write requests to the disk device per second	cnt
Disk Write Requests [Delta Avg]	Average of system.diskio.write.count_delta for individual disks	cnt
Disk Write Requests [Delta Max]	Maximum system.diskio.write.count_delta for individual disks	cnt
Disk Write Requests [Delta Min]	Minimum of system.diskio.write.count_delta for individual disks	cnt
Disk Write Requests [Delta Sum]	Sum of the system.diskio.write.count_delta of individual disks	cnt
Disk Write Requests [Success Delta]	Delta of system.diskio.write.count for each disk	cnt
Disk Write Requests [Success]	Total number of successful writes	cnt
Disk Writes Bytes	It is the number of bytes per second written to the device.	bytes
Filesystem Hang Check	filesystem (local/NFS) hang check (normal:1, abnormal:0)	status
Filesystem Nodes	It is the total number of file nodes in the file system.	cnt
Filesystem Nodes [Free]	It is the total number of available file nodes in the file system.	cnt
Filesystem Size [Available]	Disk space (bytes) that unauthorized users can use.	bytes
Filesystem Size [Free]	Available disk space (bytes)	bytes
Filesystem Size [Total]	Total disk space (bytes)	bytes
Filesystem Usage	Used disk space percentage	%
Filesystem Usage [Avg]	Average of individual filesystem.used.pct	%
Filesystem Usage [Inode]	inode usage	%
Filesystem Usage [Max]	Maximum among individual filesystem.used.pct	%
Filesystem Usage [Min]	minimum of individual filesystem.used.pct	%
Filesystem Usage [Total]	-	%
Filesystem Used	Used disk space (bytes)	bytes
Filesystem Used [Inode]	inode usage	bytes
Memory Free	Total amount of available memory (bytes). Memory used by system cache and buffers is not included (see system.memory.actual.free).	bytes
Memory Free [Actual]	Actual usable memory (bytes). The calculation method varies by OS: on Linux, it is MemAvailable from /proc/meminfo, or if meminfo cannot be used, it is calculated from available memory plus cache and buffers. On OSX, it is the sum of usable memory and inactive memory. On Windows, it corresponds to a value such as system.memory.free.	bytes
Memory Free [Swap]	Available swap memory.	bytes
Memory Total	total memory	bytes
Memory Total [Swap]	Total swap memory.	bytes
Memory Usage	Percentage of used memory ((Memory Total - Memory Free) / Memory Total) * 100 Memory Free: the amount of available memory currently free	%
Memory Usage [Actual]	Percentage of memory actually used ((Memory Total - Mememory Available) / Memory Total) * 100 or ((Memory Total - (Memmory Free + Buffers + Cached) / MemTotal) * 100 Memory Free: the amount of free memory currently available Buffers: the amount of memory used for buffers Cached: the amount of memory used for the page cache	%
Memory Usage [Cache Swap]	Cached swap usage rate	%
Memory Usage [Swap]	Percentage of used swap memory	%
Memory Used	used memory	bytes
Memory Used [Actual]	Actual used memory (bytes). The value obtained by subtracting used memory from total memory. Available memory is calculated differently for each OS (see system.actual.free).	bytes
Memory Used [Swap]	Used swap memory.	bytes
Collisions	Network collision	cnt
Network In Bytes	Number of received bytes	bytes
Network In Bytes [Delta Avg]	Average of system.network.in.bytes_delta for each network	bytes
Network In Bytes [Delta Max]	Maximum of system.network.in.bytes_delta for each network	bytes
Network In Bytes [Delta Min]	Minimum system.network.in.bytes_delta for each network	bytes
Network In Bytes [Delta Sum]	Sum of system.network.in.bytes_delta for individual networks	bytes
Network In Bytes [Delta]	Delta of received byte count	bytes
Network In Dropped	Number of deleted packets among incoming packets	cnt
Network In Errors	Number of errors during reception	cnt
Network In Packets	Number of received packets	cnt
Network In Packets [Delta Avg]	Average of system.network.in.packets_delta for individual networks	cnt
Network In Packets [Delta Max]	Maximum of system.network.in.packets_delta for each network	cnt
Network In Packets [Delta Min]	Minimum of system.network.in.packets_delta for each network	cnt
Network In Packets [Delta Sum]	Sum of system.network.in.packets_delta for individual networks	cnt
Network In Packets [Delta]	Delta of received packet count	cnt
Network Out Bytes	Number of transmitted bytes	bytes
Network Out Bytes [Delta Avg]	Average of system.network.out.bytes_delta for each network	bytes
Network Out Bytes [Delta Max]	Maximum system.network.out.bytes_delta of individual networks	bytes
Network Out Bytes [Delta Min]	Minimum of system.network.out.bytes_delta for individual networks	bytes
Network Out Bytes [Delta Sum]	Sum of system.network.out.bytes_delta for individual networks	bytes
Network Out Bytes [Delta]	Delta of transmitted byte count	bytes
Network Out Dropped	Number of deleted packets among outgoing packets. This value is not reported by the operating system, so it is always 0 on Darwin and BSD.	cnt
Network Out Errors	Number of errors during transmission	cnt
Network Out Packets	Number of transmitted packets	cnt
Network Out Packets [Delta Avg]	Average of system.network.out.packets_delta for each network	cnt
Network Out Packets [Delta Max]	Maximum of system.network.out.packets_delta for each network	cnt
Network Out Packets [Delta Min]	Minimum of system.network.out.packets_delta for each network	cnt
Network Out Packets [Delta Sum]	Sum of system.network.out.packets_delta for individual networks	cnt
Network Out Packets [Delta]	Delta of transmitted packet count	cnt
Open Connections [TCP]	All open TCP connections	cnt
Open Connections [UDP]	All open UDP connections	cnt
Port Usage	Available port usage rate	%
SYN Sent Sockets	Number of sockets in SYN_SENT state (when connecting from local to remote)	cnt
Kernel PID Max	kernel.pid_max value	cnt
Kernel Thread Max	kernel.threads-max value	cnt
Process CPU Usage	Percentage of CPU time consumed by the process since the last update. This value is similar to the %CPU value shown for the process by the top command on Unix systems.	%
Process CPU Usage/Core	The percentage of CPU time used by the process since the last event. Normalized by the number of cores, with values ranging from 0 to 100%.	%
Process Memory Usage	Proportion of main memory (RAM) occupied by a process	%
Process Memory Used	Resident Set size. The amount of memory a process occupies in RAM. In Windows, the current working set size.	bytes
Process PID	process pid	PID
Process PPID	parent process PID	PID
Processes [Dead]	Number of dead processes	cnt
Processes [Idle]	Number of idle processes	cnt
Processes [Running]	Number of running processes	cnt
Processes [Sleeping]	Number of sleeping processes	cnt
Processes [Stopped]	stopped processes count	cnt
Processes [Total]	Total number of processes	cnt
Processes [Unknown]	Number of processes with an unknown or unsearchable status	cnt
Processes [Zombie]	Number of zombie processes	cnt
Running Process Usage	process usage	%
Running Processes	Number of running processes	cnt
Running Thread Usage	Thread usage rate	%
Running Threads	Total number of threads running in running processes	cnt
Instance Status	Instance status	state
Context Switches	context switch count (per second)	cnt
Load/Core [1 min]	The load over the last 1 minute divided by the number of cores	cnt
Load/Core [15 min]	The load over the last 15 minutes divided by the number of cores	cnt
Load/Core [5 min]	The load over the last 5 minutes divided by the number of cores	cnt
Multipaths [Active]	External storage connection path status = active count	cnt
Multipaths [Failed]	External storage connection path status = failed count	cnt
Multipaths [Faulty]	External storage connection path status = faulty count	cnt
NTP Offset	measured offset of the last sample (the time difference between the NTP server and the local environment)	num
Run Queue Length	Execution queue length	num
Uptime	OS uptime(uptime). (milliseconds)	ms
Context Switchies	CPU context switch count (per second)	cnt
Disk Read Bytes [Sec]	Number of bytes read from a Windows logical disk in 1 second	cnt
Disk Read Time [Avg]	Average data read time (seconds)	sec
Disk Transfer Time [Avg]	Disk average wait time	sec
Disk Usage	Disk usage	%
Disk Write Bytes [Sec]	Number of bytes written in one second on a Windows logical disk	cnt
Disk Write Time [Avg]	Average data write time (seconds)	sec
Pagingfile Usage	Paging file usage	%
Pool Used [Non Paged]	Nonpaged Pool usage in kernel memory	bytes
Pool Used [Paged]	Paged Pool usage in kernel memory	bytes
Process [Running]	Number of currently running processes	cnt
Threads [Running]	Number of currently running threads	cnt
Threads [Waiting]	Number of threads waiting for processor time	cnt

Table. Multi-node GPU Cluster [Node] Additional monitoring metrics (Agent installation required)

2 - How-to guides

Users can create the service by entering the required information for the Multi-node GPU Cluster service and selecting detailed options through the Samsung Cloud Platform Console.

Multi-node GPU Cluster Getting Started

You can create and use a Multi-node GPU Cluster service from the Samsung Cloud Platform Console.

This service consists of a GPU Node and a Cluster Fabric service.

Create GPU Node

Follow the steps below to create a Multi-node GPU Cluster.

All Services > Compute > Multi-node GPU Cluster Click the menu. 1. Go to the Service Home page of the Multi-node GPU Cluster.
On the Service Home page, click the Create GPU Node button. 2. Navigate to the GPU Node Creation page.

On the GPU Node creation page, enter the information required to create the service, and select detailed options.

Select the required information in the Image and Version Selection area.

Category	required status	Detailed description
image	Required	Select the type of image provided Ubuntu
Image version	Required	Select version of the selected image Provide version list of the provided server image

Table. GPU Node image and version selection options

Enter or select the required information in the Service Information Input area.

Category	required status	Detailed description
Number of servers	Required	Number of GPU Node servers to create simultaneously Only numeric input is allowed, and the minimum number of servers to create is 2. When initially configuring, you must create at least 2, and expansions can be done one at a time.
Service Type > Server Type	Required	GPU Node server type Select the desired CPU, Memory, GPU, and Disk specifications For detailed information about the server types provided by GPU Node, refer to Multi-node GPU Cluster server type
Service Type > Planned Compute	Required	Planned Compute가 설정된 자원 현황 In Use: Number of resources with Planned Compute that are currently in use Configured: Number of resources with Planned Compute set Coverage Preview: Amount applied per resource by Planned Compute Apply for Planned Compute Service: Go to the Planned Compute service application page For more details, see Planned Compute 신청하기

Category

required status

Detailed description

Number of servers

Required

Number of GPU Node servers to create simultaneously

Only numeric input is allowed, and the minimum number of servers to create is 2.

When initially configuring, you must create at least 2, and expansions can be done one at a time.

Service Type > Server Type

Required

GPU Node server type

Select the desired CPU, Memory, GPU, and Disk specifications

For detailed information about the server types provided by GPU Node, refer to Multi-node GPU Cluster server type

Service Type > Planned Compute

Required

Planned Compute가 설정된 자원 현황

In Use: Number of resources with Planned Compute that are currently in use

Configured: Number of resources with Planned Compute set

Coverage Preview: Amount applied per resource by Planned Compute

Apply for Planned Compute Service: Go to the Planned Compute service application page
- For more details, see Planned Compute 신청하기

Table. GPU Node Service Information Input Items

In the Required Information Input area, enter or select the required information.

Category

Category	required status	Detailed description
Administrator account	Required	Set the administrator account and password to be used when connecting to the server Ubuntu OS is provided with a fixed root account
Server name Prefix	Required	Enter a prefix to distinguish each GPU Node generated when the selected number of servers is 2 or more Automatically generated in the form of user input value (prefix) + ‘`-###`’ Must start with a lowercase English letter and be entered using lowercase letters, numbers, and special characters (`-`) within 3 to 11 characters Must not end with a special character (`-`)
Network Settings	Required	Set the network where the GPU Node will be installed VPC Name: Select a pre‑created VPC General Subnet Name: Select a pre‑created standard Subnet IP can be Auto‑generated or Manual Input; if Manual Input is selected, the user enters the IP directly NAT: Available only when there is a single server and the VPC is attached to an Internet Gateway. Checking the option allows selection of a NAT IP. (Initially, only configurations with two or more servers can be created, so modify on the resource detail page) NAT IP: Select a NAT IP If no NAT IP is available to select, click the Create New button to generate a Public IP Refresh button to view and select the created Public IP Creating a Public IP incurs charges according to the Public IP pricing policy

required status

Detailed description

Administrator account

Required

Set the administrator account and password to be used when connecting to the server

Ubuntu OS is provided with a fixed root account

Server name Prefix

Required

Enter a prefix to distinguish each GPU Node generated when the selected number of servers is 2 or more

Automatically generated in the form of user input value (prefix) + ‘-###’

Must start with a lowercase English letter and be entered using lowercase letters, numbers, and special characters (-) within 3 to 11 characters

Must not end with a special character (-)

Network Settings

Required

Set the network where the GPU Node will be installed

VPC Name: Select a pre‑created VPC

General Subnet Name: Select a pre‑created standard Subnet
- IP can be Auto‑generated or Manual Input; if Manual Input is selected, the user enters the IP directly

NAT: Available only when there is a single server and the VPC is attached to an Internet Gateway. Checking the option allows selection of a NAT IP. (Initially, only configurations with two or more servers can be created, so modify on the resource detail page)

NAT IP: Select a NAT IP
- If no NAT IP is available to select, click the Create New button to generate a Public IP
- Refresh button to view and select the created Public IP
- Creating a Public IP incurs charges according to the Public IP pricing policy

Table. GPU Node required information input items

In the Cluster Selection area, create or select a Cluster Fabric.

Category	required status	Detailed description
Cluster Fabric	Required	Configuration of GPU Node server groups that can apply GPU Direct RDMA together Optimal GPU performance and speed can be secured only within the same Cluster Fabric When creating a new Cluster Fabric, *New Input > select Node pool, then enter the name of the Cluster Fabric to be created To add to an existing Cluster Fabric, Existing Input > select Node pool, then choose the previously created Cluster Fabric

Category

required status

Detailed description

Cluster Fabric

Required

Configuration of GPU Node server groups that can apply GPU Direct RDMA together

Optimal GPU performance and speed can be secured only within the same Cluster Fabric

When creating a new Cluster Fabric, *New Input > select Node pool, then enter the name of the Cluster Fabric to be created

To add to an existing Cluster Fabric, Existing Input > select Node pool, then choose the previously created Cluster Fabric

Table. GPU Node Cluster Fabric options

In the Additional Information Input area, enter or select the required information.

Category	required status	Detailed description
Lock	Selection	Using a lock prevents actions caused by mistakes, such as terminating, starting, or stopping the server.
Init Script	Selection	Script to run at server startup The Init Script must be selected differently depending on the image type For Linux: Choose Shell Script or cloud-init
tag	Selection	Add Tag You can add up to 50 per resource After clicking the Add Tag button, enter or select Key, Value values

Category

required status

Detailed description

Lock

Selection

Using a lock prevents actions caused by mistakes, such as terminating, starting, or stopping the server.

Init Script

Selection

Script to run at server startup

The Init Script must be selected differently depending on the image type
- For Linux: Choose Shell Script or cloud-init

tag

Selection

Add Tag

You can add up to 50 per resource

After clicking the Add Tag button, enter or select Key, Value values

Table. GPU Node additional information input fields

Summary Check the detailed information and estimated billing amount generated in the panel, and click the Create button.
When the popup notifying creation opens, click the Confirm button.
- When creation is complete, check the created resources on the GPU Node List page.

Caution

When creating a service, the GPU MIG/ECC settings are reset. * However, to apply the correct settings, perform an initial reboot, verify that the settings have been applied, and then use it.
For detailed information on resetting GPU MIG/ECC settings, refer to the GPU MIG/ECC 설정 초기화 점검 가이드.

Check GPU Node detailed information

The Multi-node GPU Cluster service can view and modify the full list of GPU Node resources and detailed information.

GPU Node Details page consists of Details, Tags, Job History tabs.

To view detailed information about the GPU Node, follow these steps.

Click the All Services > Compute > Multi-node GPU Cluster > GPU Node menu. 1. Go to the Service Home page of the Multi-node GPU Cluster.

On the Service Home page, click the GPU Node menu. 2. Go to the GPU Node list page.

Resource items other than the required columns can be added through the Settings button.

Category	required status	Detailed description
Resource ID	Selection	User-created GPU Node ID
Cluster Fabric name	Required	User-created Cluster Fabric name
Server name	Required	User-created GPU Node name
Server type	Required	Server type of GPU Node The user can view the number of cores, memory capacity, and GPU type and count of the created resources
image	Required	User-generated GPU Node image version
IP	Required	IP of the GPU node created by the user
status	Required	Status of the GPU Node created by the user
Creation date and time	Selection	GPU Node creation timestamp

Table. GPU Node resource list items

GPU Node List page, click the resource to view detailed information. 3. Go to the GPU Node Details page.

GPU Server Details At the top of the page, status information and descriptions of additional features are displayed.

Category	Detailed description
GPU Node status	Status of the GPU Node created by the user Creating: State while the server is being created Running:: State when creation is complete and the server is available for use Editing:: State while the IP is being changed Unknown: Error state Starting: State while the server is starting Stopping: State while the server is stopping Stopped: State when the server has stopped Terminating: State while terminating Terminated: State when termination is complete
Server control	Button to change server status Start: Start a stopped server Stop: Stop a running server
Service cancellation	Cancel service button

Category

Detailed description

GPU Node status

Status of the GPU Node created by the user

Creating: State while the server is being created

Running:: State when creation is complete and the server is available for use

Editing:: State while the IP is being changed

Unknown: Error state

Starting: State while the server is starting

Stopping: State while the server is stopping

Stopped: State when the server has stopped

Terminating: State while terminating

Terminated: State when termination is complete

Server control

Button to change server status

Start: Start a stopped server

Stop: Stop a running server

Service cancellation

Cancel service button

Table. GPU Node status information and additional features

Detailed Information

On the GPU Node List page’s Details Tab, you can view the detailed information of the selected resource and edit the information if needed.

Category	Detailed description
service	service name
Resource Type	Resource Type
SRN	Unique resource ID in Samsung Cloud Platform In a GPU Node, it means the GPU Node SRN
Resource name	Resource name In the GPU Node service, it refers to the GPU Node name
Resource ID	Unique resource ID in the service
Constructor	User who created the service
Creation date and time	Service creation date and time
Modifier	User who edited the service information
Modification date and time	Date and time the service information was modified
Server name	server name
Node pool	A collection of nodes that can be grouped into the same Cluster Fabric
Cluster Fabric name	User-created Cluster Fabric name
Image/Version	Server OS image and version
Server type	CPU, memory, GPU, information display
Planned Compute	Resource status with Planned Compute configured For more details, refer to Planned Compute 신청하기
Lock	Indicates whether Lock is enabled/disabled When Lock is enabled, it prevents server termination/start/stop actions, avoiding accidental operations If you need to change the Lock property value, click the Edit button to set it
Network	GPU Node network information VPC name, regular Subnet name, IP, Public NAT IP, and status
Block Storage	Block Storage information connected to the server Volume name, disk type, capacity, status
Init Script	View the Init Script content entered during server creation

Table. GPU Node detailed information tab items

note

If the VPC does not have an Internet Gateway attached, you cannot attach a Public NAT IP.

Category	Detailed description
Tag list	Tag list You can view the Key and Value information of the tag Up to 50 tags can be added per resource When entering a tag, you can search and select from the list of previously created Keys and Values

Job History

GPU Node List page’s Job History Tab allows you to view the job history of the selected resource.

Category	Detailed description
Task History List	Resource Change History Check operation details, operation time, resource type, resource name, event topic, operation result, operator information Detailed Search button provides detailed search functionality

Category

Detailed description

Task History List

Resource Change History

Check operation details, operation time, resource type, resource name, event topic, operation result, operator information

Detailed Search button provides detailed search functionality

Table. GPU Node Job History Tab Detailed Information Items

Control GPU Node operation

If you need server control and management functions for the created GPU Node resources, you can perform tasks on the GPU Node List or GPU Node Details page. You can start and stop the running GPU Node resources.

Getting Started with GPU Node

You can start a stopped (Stopped) GPU Node. To start the GPU Node, follow these steps.

All Services > Compute > Multi-node GPU Cluster Click the menu. 1. Go to the Service Home page of the Multi-node GPU Cluster.
On the Service Home page, click the GPU Node menu. 2. Go to the GPU Node list page.
- GPU Node List page, after selecting individual or multiple servers with the checkbox, you can Start via the More button at the top.
GPU Node List page, click Resources. 3. Go to the GPU Node Details page.
- On the GPU Node Details page, click the Start button at the top to start the server.
Check the server status and complete the status change.

Stop GPU Node

You can stop a GPU node that is (Active). To stop the GPU Node, follow the steps below.

All Services > Compute > Multi-node GPU Cluster Click the menu. 1. Go to the Service Home page of the Multi-node GPU Cluster.
On the Service Home page, click the GPU Node menu. 2. Go to the GPU Node list page.
- GPU Node List page, after selecting individual or multiple servers with the checkboxes, you can control them using the Stop button at the top.
On the GPU Node List page, click Resources. 3. Go to the GPU Node Details page.
- Click the 중지 button at the top of the GPU Node 상세 page to stop the server.
Check the server status and complete the status change.

Terminate GPU Node

You can terminate unused GPU nodes to reduce operating costs. However, if you terminate the service, the running service may be discontinued immediately, so you should proceed with the termination only after fully considering the impact that may arise from the service interruption.

Caution

Please be aware that data cannot be recovered after terminating the service.

To cancel a GPU Node, follow these steps.

All Services > Compute > Multi-node GPU Server Click the menu. 1. Go to the Service Home page of the Multi-node GPU Cluster.
On the Service Home page, click the Cluster Fabric menu. 2. Go to the Cluster Fabric List page.
On the Cluster Fabric List page, select the resources to cancel, and click the Cancel Service button.
- Resources that use the same Cluster Fabric can be terminated simultaneously.
When termination is complete, check on the GPU Node List page whether the resources have been terminated.

Notice

The cases where a GPU Node cannot be terminated are as follows.

When Block Storage(BM) is connected: Please disconnect the Block Storage(BM) connection first.
When File Storage is connected: Please disconnect the File Storage connection first.
If Lock is set: Please change the Lock setting to disabled and try again.
If the selection includes a server that cannot be terminated simultaneously: Please re-select only resources that can be terminated.
If the server you want to decommission has a different Cluster Fabric: Select only resources that use the same Cluster Fabric.

Reference

If all GPU Nodes in the Cluster Fabric are deleted, the Cluster Fabric is automatically deleted.

2.1 - Cluster Fabric Management

Cluster Fabric is a service that helps manage the servers (GPU Nodes) included in a GPU Cluster. Using Cluster Fabric, you can move servers between GPU clusters in the same node pool and optimize GPU performance and speed within a single GPU cluster.

Create Cluster Fabric

Cluster Fabric can be created together when a GPU Node is created, and it cannot be created or deleted separately. If all GPU Nodes in the Cluster Fabric are terminated, the Cluster Fabric is automatically deleted.
If you have not created a GPU Node, please create a GPU Node first. For more details, please refer to GPU Node 생성하기.

Check detailed information of Cluster Fabric

guide

Cluster Fabric can be created together when a GPU node is created, and it cannot be created or deleted independently.
If all GPU Nodes in the Cluster Fabric are terminated, the Cluster Fabric is automatically deleted.
If you have not created a GPU Node, please create a GPU Node first. * For more details, refer to GPU Node 생성하기.

On the Cluster Fabric List page and the Cluster Fabric Details page, you can view the generated Cluster Fabric list and details, and move the server.

All Services > Compute > Multi-node GPU Server Click the menu. 1. Go to the Service Home page of the Multi-node GPU Cluster.

On the Service Home page, click the Cluster Fabric menu. 2. Go to the Cluster Fabric List page.

Cluster Fabric List page allows you to view the resource list of the GPU Cluster you created.

Resource items other than the required columns can be added through the Settings button.

Category	required status	Detailed description
Resource ID	Selection	User-generated Cluster Fabric ID
Cluster Fabric name	Required	User-created Cluster Fabric name
Node pool	Selection	A collection of nodes that can be grouped into the same Cluster Fabric
Number of servers	Selection	Number of GPU Nodes
Server type	Selection	Server type of GPU Node The user can view the number of cores, memory capacity, and GPU type and count of the created resources
status	Selection	Status of the user-created Cluster Fabric
Creation date and time	Selection	Cluster Fabric creation timestamp

Table. Cluster Fabric resource list items

Cluster Fabric List page: click the resource to view detailed information. 3. Navigate to the Cluster Fabric Details page.

Cluster Fabric Details At the top of the page, status information and descriptions of additional features are displayed.

Category	Detailed description
Cluster Fabric status	Status of the Cluster Fabric created by the user Creating: State while the cluster is being created Active: State when creation is complete and it is usable Editing: State while the IP is being changed Deleting: State while being terminated Deleted: State after termination is complete
Add target server	A feature that allows moving a server from another cluster to the target cluster.

Category

Detailed description

Cluster Fabric status

Status of the Cluster Fabric created by the user

Creating: State while the cluster is being created

Active: State when creation is complete and it is usable

Editing: State while the IP is being changed

Deleting: State while being terminated

Deleted: State after termination is complete

Add target server

A feature that allows moving a server from another cluster to the target cluster.

Table. Cluster Fabric status information and additional functions

Detailed Information

On the Cluster Fabric List page’s Details Tab, you can view the detailed information of the selected resource and retrieve servers from another cluster.

Category	Detailed description
service	service name
Resource Type	Resource Type
SRN	Unique resource ID in Samsung Cloud Platform In Cluster Fabric, it refers to the Cluster Fabric SRN
Resource name	resource name In the Cluster Fabric service, it refers to the Cluster Fabric name
Resource ID	Unique resource ID in the service
Constructor	User who created the service
Creation date and time	Service creation date and time
Modifier	User who edited the service information
Modification date and time	Date and time the service information was modified
Cluster Fabric name	User-created Cluster Fabric name
Node pool	A collection of nodes that can be grouped into the same Cluster Fabric
Target server	List of GPU Nodes bound to Cluster Fabric Server name, server type, IP, status

Table. Cluster Fabric detailed information tab items

Import Cluster Fabric server

Using the Add Target Server feature on the Cluster Fabric Details page, you can import servers from another cluster and add them to the selected cluster.

All Services > Compute > Multi-node GPU Server Click the menu. 1. Go to the Service Home page of the Multi-node GPU Cluster.
On the Service Home page, click the Cluster Fabric menu. 2. Go to the Cluster Fabric List page.
Cluster Fabric List page: click the resource to view detailed information. 3. Navigate to the Cluster Fabric Details page.
Click the right Add button on the target server in the Details tab.
- The popup window for adding a target server opens. *
  - Select a cluster in Cluster Fabric.
  - GPU nodes associated with the selected cluster are listed; select the GPU node you want to retrieve.
  - The selected GPU Node’s name is displayed at the bottom.
  - Press the Confirm button to complete.
  - Pressing the Cancel button cancels the operation.
- Check whether the added GPU Node is visible on the target server.

Terminate Cluster Fabric

If all GPU Nodes in the Cluster Fabric are terminated, the Cluster Fabric is automatically deleted. For more details, see GPU Node 해지하기.

2.2 - Install ServiceWatch Agent

Users can install the ServiceWatch Agent on the GPU nodes of a Multi-node GPU Cluster to collect custom metrics and logs.

Reference

Custom metric/log collection via the ServiceWatch Agent is currently available only on Samsung Cloud Platform For Enterprise. It will also be available in other offerings in the future.

Caution

Since metric collection through the ServiceWatch Agent is classified as a custom metric and incurs charges unlike the default metrics, it is recommended to remove or disable any unnecessary metric collection settings.

ServiceWatch Agent

In a Multi-node GPU Cluster, the agents that need to be installed on GPU nodes to collect ServiceWatch custom metrics and logs can be divided into two main types. It is the Prometheus Exporter and Open Telemetry Collector.

Category	Detailed description
Prometheus Exporter	Provide metrics of a specific application or service in a format that Prometheus can scrape For collecting OS metrics on a GPU Node, you can use the Node Exporter for Linux servers and the Windows Exporter for Windows servers, depending on the OS type. On a GPU Node, you can use the Node Exporter for OS metric collection just like on a Virtual Server; see Virtual Server > ServiceWatch Agent for details You can use the DCGM (NVIDIA Data Center GPU Manager) Exporter for GPU Nodes Refer to GPU Server > ServiceWatch Agent Installation for how to use the DCGM Exporter
Open Telemetry Collector	Acts as a centralized collector that gathers telemetry data such as metrics and logs from distributed systems, processes (filtering, sampling, etc.) it, and exports it to various backends (e.g., Prometheus, Jaeger, Elasticsearch, etc.) Exports data via the ServiceWatch Gateway so that ServiceWatch can collect metric and log data. For instructions on using the OpenTelemetry Collector provided by ServiceWatch, see ServiceWatch > ServiceWatch Agent 사용하기

Category

Detailed description

Prometheus Exporter

Provide metrics of a specific application or service in a format that Prometheus can scrape

For collecting OS metrics on a GPU Node, you can use the Node Exporter for Linux servers and the Windows Exporter for Windows servers, depending on the OS type.
- On a GPU Node, you can use the Node Exporter for OS metric collection just like on a Virtual Server; see Virtual Server > ServiceWatch Agent for details

You can use the DCGM (NVIDIA Data Center GPU Manager) Exporter for GPU Nodes
- Refer to GPU Server > ServiceWatch Agent Installation for how to use the DCGM Exporter

Open Telemetry Collector

Acts as a centralized collector that gathers telemetry data such as metrics and logs from distributed systems, processes (filtering, sampling, etc.) it, and exports it to various backends (e.g., Prometheus, Jaeger, Elasticsearch, etc.)

Exports data via the ServiceWatch Gateway so that ServiceWatch can collect metric and log data.

For instructions on using the OpenTelemetry Collector provided by ServiceWatch, see ServiceWatch > ServiceWatch Agent 사용하기

Table. Explanation of Prometheus Exporter and Open Telemetry Collector

information

If you have configured a Kubernetes Engine on a GPU node, please view the GPU metrics using the metrics provided by the Kubernetes Engine.

If you install the DCGM Exporter on a GPU node configured with Kubernetes Engine, it may not operate correctly.

Reference

The ServiceWatch Agent guide for collecting GPU metrics on a GPU Node can be used the same way as on a GPU Server. For more details, see GPU Server > ServiceWatch Agent.

Pre-configuration for Using ServiceWatch Agent

To use the ServiceWatch Agent, please refer to ServiceWatch Agent를 위한 사전 환경 설정 and prepare the prerequisite configuration.

2.3 - Multi-node GPU Cluster Service Scope and Inspection Guide

Multi-node GPU Cluster Service Scope

If an IaaS hardware-level issue occurs with the Multi-node GPU Cluster service, you can receive technical support through Contact Us in the Support Center. However, the risks associated with changes such as OS kernel updates or application installations are the user’s responsibility, so technical support is limited; please be mindful when performing tasks such as system updates.

IaaS hardware level issue

HW fault event messages generated within the server by the IPMI hardware monitoring console.
GPU HW operation error observed in the nvdia-smi command
HW error messages that occur during inspection of InfiniBand HCA cards or InfiniBand Switches

Caution

Since the Multi-node GPU Cluster is a service sensitive to software version compatibility of Ubuntu OS / NVDIA / Infiniband, official technical support is unavailable after changes such as a user’s OS kernel update or application installation.

IaaS HW Inspection Guide

After applying for the Multi-node GPU Cluster service, it is recommended to check the IaaS HW level according to the inspection guide.

Intel E810 driver update

Check the version of the Intel E810 driver and, if necessary, refer to the following procedure to perform an update.

Server manufacturer Intel E810 driver minimum recommended version: 1.15.4 or later
Driver download: Intel Network Adapter Driver for 800 Series Devices under Linux

Caution

Proceed with the update only if the Multi-node GPU Cluster Node has an Intel E810 Device and is simultaneously using a standard image version 535.86.10 or lower.

Reference

You can use the lspci command to verify whether an E810 NIC device is present. If the E810 NIC is valid, the PCIe device is identified as follows using the E810-C information (if there is no E810 NIC device, this operation is not performed).

<div class="code-block-buttons">
  
  <button class="code-block-download">
    <div class="code-block-download-icon">
      <svg width="14" height="16" viewBox="0 0 14 16" fill="none" xmlns="http://www.w3.org/2000/svg"><path d="M13.999 14.4353v-2.998C13.999 11.0232 13.6602 10.6853 13.2461 10.6853S12.5 11.0232 12.5 11.4373v2.2519H1.5V11.4373C1.5 11.0232 1.16211 10.6853.748047 10.6853.333984 10.6853 976563e-9 11.0232 976562e-9 11.4373v2.998C976562e-9 14.8494.333984 15.1892.748047 15.1892H13.2461c.414099999999999.0.7529-.3398.7529-.7539z" fill="#5135ff"/><path d="M1.41169 6.21654c.25876-.32345.73073-.37589 1.05417-.11713l3.78463 3.0277V1.56104c0-.41422.33578-.750005.75-.750005.41421.0.75.335785.75.750005V9.12208L11.5288 6.09941C11.8523 5.84065 12.3242 5.89309 12.583 6.21654 12.8418 6.53999 12.7893 7.01196 12.4659 7.27071L6.99734 11.6455 1.52882 7.27071c-.32345-.25875-.37589-.73072-.11713-1.05417z" fill="#5135ff"/></svg></div>
    Sample Code Download
  </button>
  
  <button class="code-block-copy">
    <div class="code-block-copy-icon">
      <svg width="14" height="14" viewBox="0 0 14 14" fill="none" xmlns="http://www.w3.org/2000/svg"><path fill-rule="evenodd" clip-rule="evenodd" d="M10 4.00012V1c0-.552285-.44772-1-1-1H1C.447715.0.0.447715.0 1V9c0 .55228.447715 1 1 1H3.99988v3.0001C3.99988 13.5515 4.44849 14 4.99988 14H10.0002l3.9997-3.9995V5c0-.55139-.448600000000001-.99988-1-.99988H10zM1.4 1.4V8.6H3.99988V5c0-.55139.44861-.99988 1-.99988H8.6V1.4H1.4zM5.3999 12.6H9.08295V9.78301C9.08295 9.39641 9.39635 9.08301 9.78295 9.08301H12.5999V5.40015h-7.2V12.6zm6.1375-2.117-1.0545 1.0544V10.483h1.0545z" fill="currentcolor"/></svg></div>
    Copy Code
  </button>
</div>

Color mode

lspci | grep E810
0000:6a:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-C for QSFP (rev 02) 0000:6a:00.1 Ethernet controller: Intel Corporation Ethernet Controller E810-C for QSFP (rev 02)

lspci | grep E810
0000:6a:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-C for QSFP (rev 02) 0000:6a:00.1 Ethernet controller: Intel Corporation Ethernet Controller E810-C for QSFP (rev 02)

</div>

Example of checking for the presence of an E810 NIC device

To update the driver, follow the steps below.

Move the base driver tar file to the desired directory.
Color mode
/usr/local/src
/usr/local/src
Directory navigation example
Untar / unzip the Archiver file.
- x.x.x is the version number of the driver tar file.
  Color mode
  tar zxf ice-x.x.x.tar.gz
  tar zxf ice-x.x.x.tar.gz
  untar example
Change the driver to the src directory.
- x.x.x is the version number of the driver tar file.
  Color mode
  cd ice-x.x.x/src/
  cd ice-x.x.x/src/
  Example of changing to the src directory
Compile the driver module.
Color mode
make install
make install
Driver module compilation example
After the update is complete, check the version.
Color mode
lsmod | grep ice modinfo ice | grep version
lsmod | grep ice modinfo ice | grep version
Version check example

Check NVIDIA driver

To check the NVIDIA driver (nvidia-smi topo, IB nv_peer_mem status) and inspect the IaaS hardware level, follow these steps.

Reference

The example of applying MIG is described using an A100 GPU node as the reference.

Check the GPU driver status.

Color mode

~$ nvidia-smi

~$ nvidia-smi

Example code for checking GPU driver status

Color mode

Thu Jan 29 14:48:31 2026
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.06             Driver Version: 535.183.06   CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap | Memory-Usage | GPU-Util  Compute M. |
|  |  | MIG M. |
| =========================================+======================+====================== |
| 0  NVIDIA A100-SXM4-80GB          On | 00000000:00:05.0 Off | On |
| N/A   36C    P0              52W / 400W | 0MiB / 81920MiB | N/A      Default |
|  |  | Enabled |
+-----------------------------------------+----------------------+----------------------+
| 1  NVIDIA A100-SXM4-80GB          On | 00000000:00:06.0 Off | 0 |
| N/A   36C    P0              61W / 400W | 0MiB / 81920MiB | 0%      Default |
|  |  | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 2  NVIDIA A100-SXM4-80GB          On | 00000000:00:07.0 Off | 0 |
| N/A   36C    P0              64W / 400W | 0MiB / 81920MiB | 0%      Default |
|  |  | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 3  NVIDIA A100-SXM4-80GB          On | 00000000:00:08.0 Off | 0 |
| N/A   40C    P0              64W / 400W | 0MiB / 81920MiB | 0%      Default |
|  |  | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 4  NVIDIA A100-SXM4-80GB          On | 00000000:00:09.0 Off | 0 |
| N/A   36C    P0              63W / 400W | 0MiB / 81920MiB | 0%      Default |
|  |  | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 5  NVIDIA A100-SXM4-80GB          On | 00000000:00:0A.0 Off | 0 |
| N/A   40C    P0              64W / 400W | 0MiB / 81920MiB | 0%      Default |
|  |  | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 6  NVIDIA A100-SXM4-80GB          On | 00000000:00:0B.0 Off | 0 |
| N/A   39C    P0              65W / 400W | 0MiB / 81920MiB | 0%      Default |
|  |  | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 7  NVIDIA A100-SXM4-80GB          On | 00000000:00:0C.0 Off | 0 |
| N/A   39C    P0              60W / 400W | 0MiB / 81920MiB | 0%      Default |
|  |  | Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| MIG devices: |
+------------------+--------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG | Memory-Usage | Vol | Shared |
| ID  ID  Dev | BAR1-Usage | SM     Unc | CE ENC DEC OFA JPG |
|  |  | ECC |  |
| ==================+================================+===========+======================= |
| No MIG devices found |
+---------------------------------------------------------------------------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU   GI   CI        PID   Type   Process name                            GPU Memory |
| ID   ID                                                             Usage |
| ======================================================================================= |
| No running processes found |
+---------------------------------------------------------------------------------------+

Thu Jan 29 14:48:31 2026
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.06             Driver Version: 535.183.06   CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap | Memory-Usage | GPU-Util  Compute M. |
|  |  | MIG M. |
| =========================================+======================+====================== |
| 0  NVIDIA A100-SXM4-80GB          On | 00000000:00:05.0 Off | On |
| N/A   36C    P0              52W / 400W | 0MiB / 81920MiB | N/A      Default |
|  |  | Enabled |
+-----------------------------------------+----------------------+----------------------+
| 1  NVIDIA A100-SXM4-80GB          On | 00000000:00:06.0 Off | 0 |
| N/A   36C    P0              61W / 400W | 0MiB / 81920MiB | 0%      Default |
|  |  | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 2  NVIDIA A100-SXM4-80GB          On | 00000000:00:07.0 Off | 0 |
| N/A   36C    P0              64W / 400W | 0MiB / 81920MiB | 0%      Default |
|  |  | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 3  NVIDIA A100-SXM4-80GB          On | 00000000:00:08.0 Off | 0 |
| N/A   40C    P0              64W / 400W | 0MiB / 81920MiB | 0%      Default |
|  |  | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 4  NVIDIA A100-SXM4-80GB          On | 00000000:00:09.0 Off | 0 |
| N/A   36C    P0              63W / 400W | 0MiB / 81920MiB | 0%      Default |
|  |  | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 5  NVIDIA A100-SXM4-80GB          On | 00000000:00:0A.0 Off | 0 |
| N/A   40C    P0              64W / 400W | 0MiB / 81920MiB | 0%      Default |
|  |  | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 6  NVIDIA A100-SXM4-80GB          On | 00000000:00:0B.0 Off | 0 |
| N/A   39C    P0              65W / 400W | 0MiB / 81920MiB | 0%      Default |
|  |  | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 7  NVIDIA A100-SXM4-80GB          On | 00000000:00:0C.0 Off | 0 |
| N/A   39C    P0              60W / 400W | 0MiB / 81920MiB | 0%      Default |
|  |  | Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| MIG devices: |
+------------------+--------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG | Memory-Usage | Vol | Shared |
| ID  ID  Dev | BAR1-Usage | SM     Unc | CE ENC DEC OFA JPG |
|  |  | ECC |  |
| ==================+================================+===========+======================= |
| No MIG devices found |
+---------------------------------------------------------------------------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU   GI   CI        PID   Type   Process name                            GPU Memory |
| ID   ID                                                             Usage |
| ======================================================================================= |
| No running processes found |
+---------------------------------------------------------------------------------------+

GPU driver status example

Check the NVSwitch and NVLink hardware status.
- Check NVSwitch status
  Copy Code
  Color mode
  ~$ nvidia-smi nvlink --status
  ~$ nvidia-smi nvlink --status
  NVSwitch status check example

Copy Code

Color mode

GPU 1: NVIDIA A100-SXM4-80GB (UUID: GPU-64a2f685-bb12-c4af-105c-0726ece9c8d7) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s 2: NVIDIA A100-SXM4-80GB (UUID: GPU-2269851b-71cd-f6c7-50c5-ba1525cf3ce8) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s 3: NVIDIA A100-SXM4-80GB (UUID: GPU-4c397bbf-95fc-5c29-918a-a429cbe45a7a) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s 4: NVIDIA A100-SXM4-80GB (UUID: GPU-0e350204-9fb6-2cbe-538e-8f7849658eb8) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s 5: NVIDIA A100-SXM4-80GB (UUID: GPU-45f0c453-4760-edd4-3af9-25c5ea7473a5) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s 6: NVIDIA A100-SXM4-80GB (UUID: GPU-38409794-bb34-430e-3c50-90b42cb2bb72) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s 7: NVIDIA A100-SXM4-80GB (UUID: GPU-3fb478aa-801b-eb64-55c2-0ffc3f2ce404) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s

GPU 1: NVIDIA A100-SXM4-80GB (UUID: GPU-64a2f685-bb12-c4af-105c-0726ece9c8d7) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s 2: NVIDIA A100-SXM4-80GB (UUID: GPU-2269851b-71cd-f6c7-50c5-ba1525cf3ce8) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s 3: NVIDIA A100-SXM4-80GB (UUID: GPU-4c397bbf-95fc-5c29-918a-a429cbe45a7a) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s 4: NVIDIA A100-SXM4-80GB (UUID: GPU-0e350204-9fb6-2cbe-538e-8f7849658eb8) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s 5: NVIDIA A100-SXM4-80GB (UUID: GPU-45f0c453-4760-edd4-3af9-25c5ea7473a5) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s 6: NVIDIA A100-SXM4-80GB (UUID: GPU-38409794-bb34-430e-3c50-90b42cb2bb72) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s 7: NVIDIA A100-SXM4-80GB (UUID: GPU-3fb478aa-801b-eb64-55c2-0ffc3f2ce404) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s

NVSwitch status example

Check NVLinks hardware status

Color mode

~$ nvidia-smi topo -m

~$ nvidia-smi topo -m

Example code for checking NVLink hardware status

Color mode

        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV12    NV12    NV12    NV12    NV12    NV12    NV12    0-127   0-7             N/A
GPU1    NV12	 X      NV12    NV12    NV12    NV12    NV12    NV12    0-127   0-7             N/A
GPU2    NV12    NV12     X      NV12    NV12    NV12    NV12    NV12    0-127   0-7             N/A
GPU3    NV12    NV12    NV12     X      NV12    NV12    NV12    NV12    0-127   0-7             N/A
GPU4    NV12    NV12    NV12    NV12     X      NV12    NV12    NV12    0-127   0-7             N/A
GPU5    NV12    NV12    NV12    NV12    NV12     X      NV12    NV12    0-127   0-7             N/A
GPU6    NV12    NV12    NV12    NV12    NV12    NV12     X      NV12    0-127   0-7             N/A
GPU7    NV12    NV12    NV12    NV12    NV12    NV12    NV12     X      0-127   0-7             N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV12    NV12    NV12    NV12    NV12    NV12    NV12    0-127   0-7             N/A
GPU1    NV12	 X      NV12    NV12    NV12    NV12    NV12    NV12    0-127   0-7             N/A
GPU2    NV12    NV12     X      NV12    NV12    NV12    NV12    NV12    0-127   0-7             N/A
GPU3    NV12    NV12    NV12     X      NV12    NV12    NV12    NV12    0-127   0-7             N/A
GPU4    NV12    NV12    NV12    NV12     X      NV12    NV12    NV12    0-127   0-7             N/A
GPU5    NV12    NV12    NV12    NV12    NV12     X      NV12    NV12    0-127   0-7             N/A
GPU6    NV12    NV12    NV12    NV12    NV12    NV12     X      NV12    0-127   0-7             N/A
GPU7    NV12    NV12    NV12    NV12    NV12    NV12    NV12     X      0-127   0-7             N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NVLink HW status check code example

Check the InfiniBand (IB) HCA card hardware status and link.

Color mode

user@bm-dev-001:~$ ibdev2netdev -v

user@bm-dev-001:~$ ibdev2netdev -v

HW status check command example

Color mode

cat: /sys/class/infiniband/mlx5_0/device/vpd: Permission denied
0000:45:00.0 mlx5_0 (MT4123 -            )                 fw 20.29.1016 port 1 (ACTIVE) ==> ibs18 (Down)
cat: /sys/class/infiniband/mlx5_1/device/vpd: Permission denied
0000:0e:00.0 mlx5_1 (MT4123 -            )                 fw 20.29.1016 port 1 (ACTIVE) ==> ibs17 (Down)
cat: /sys/class/infiniband/mlx5_2/device/vpd: Permission denied
0000:c5:00.0 mlx5_2 (MT4123 -            )                 fw 20.29.1016 port 1 (ACTIVE) ==> ibs20 (Down)
cat: /sys/class/infiniband/mlx5_3/device/vpd: Permission denied
0000:85:00.0 mlx5_3 (MT4123 -            )                 fw 20.29.1016 port 1 (ACTIVE) ==> ibs19 (Down)
user@bm-dev-001:~$

cat: /sys/class/infiniband/mlx5_0/device/vpd: Permission denied
0000:45:00.0 mlx5_0 (MT4123 -            )                 fw 20.29.1016 port 1 (ACTIVE) ==> ibs18 (Down)
cat: /sys/class/infiniband/mlx5_1/device/vpd: Permission denied
0000:0e:00.0 mlx5_1 (MT4123 -            )                 fw 20.29.1016 port 1 (ACTIVE) ==> ibs17 (Down)
cat: /sys/class/infiniband/mlx5_2/device/vpd: Permission denied
0000:c5:00.0 mlx5_2 (MT4123 -            )                 fw 20.29.1016 port 1 (ACTIVE) ==> ibs20 (Down)
cat: /sys/class/infiniband/mlx5_3/device/vpd: Permission denied
0000:85:00.0 mlx5_3 (MT4123 -            )                 fw 20.29.1016 port 1 (ACTIVE) ==> ibs19 (Down)
user@bm-dev-001:~$

Example of HW status check result

Color mode

root@bm-dev-001:~# ibstat

root@bm-dev-001:~# ibstat

Example of link verification command

Color mode

CA 'mlx5_0'
        CA type: MT4123
        Number of ports: 1
        Firmware version: 20.29.1016
        Hardware version: 0
        Node GUID: 0x88e9a4ffff5060ac
        System image GUID: 0x88e9a4ffff5060ac
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 200
                Base lid: 8
                LMC: 0
                SM lid: 1
                Capability mask: 0x2651e848
                Port GUID: 0x88e9a4ffff5060ac
                Link layer: InfiniBand
CA 'mlx5_1'
        CA type: MT4123
        Number of ports: 1
        Firmware version: 20.29.1016
        Hardware version: 0
        Node GUID: 0x88e9a4ffff504080
        System image GUID: 0x88e9a4ffff504080
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 200
                Base lid: 5
                LMC: 0
                SM lid: 1
                Capability mask: 0x2651e848
                Port GUID: 0x88e9a4ffff504080
                Link layer: InfiniBand
CA 'mlx5_2'
        CA type: MT4123
        Number of ports: 1
        Firmware version: 20.29.1016
        Hardware version: 0
        Node GUID: 0x88e9a4ffff505038
        System image GUID: 0x88e9a4ffff505038
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 200
                Base lid: 2
                LMC: 0
                SM lid: 1
                Capability mask: 0x2651e848
                Port GUID: 0x88e9a4ffff505038
                Link layer: InfiniBand
CA 'mlx5_3'
        CA type: MT4123
        Number of ports: 1
        Firmware version: 20.29.1016
        Hardware version: 0
        Node GUID: 0x88e9a4ffff504094
        System image GUID: 0x88e9a4ffff504094
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 200
                Base lid: 7
                LMC: 0
                SM lid: 1
                Capability mask: 0x2651e848
                Port GUID: 0x88e9a4ffff504094
                Link layer: InfiniBand

CA 'mlx5_0'
        CA type: MT4123
        Number of ports: 1
        Firmware version: 20.29.1016
        Hardware version: 0
        Node GUID: 0x88e9a4ffff5060ac
        System image GUID: 0x88e9a4ffff5060ac
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 200
                Base lid: 8
                LMC: 0
                SM lid: 1
                Capability mask: 0x2651e848
                Port GUID: 0x88e9a4ffff5060ac
                Link layer: InfiniBand
CA 'mlx5_1'
        CA type: MT4123
        Number of ports: 1
        Firmware version: 20.29.1016
        Hardware version: 0
        Node GUID: 0x88e9a4ffff504080
        System image GUID: 0x88e9a4ffff504080
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 200
                Base lid: 5
                LMC: 0
                SM lid: 1
                Capability mask: 0x2651e848
                Port GUID: 0x88e9a4ffff504080
                Link layer: InfiniBand
CA 'mlx5_2'
        CA type: MT4123
        Number of ports: 1
        Firmware version: 20.29.1016
        Hardware version: 0
        Node GUID: 0x88e9a4ffff505038
        System image GUID: 0x88e9a4ffff505038
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 200
                Base lid: 2
                LMC: 0
                SM lid: 1
                Capability mask: 0x2651e848
                Port GUID: 0x88e9a4ffff505038
                Link layer: InfiniBand
CA 'mlx5_3'
        CA type: MT4123
        Number of ports: 1
        Firmware version: 20.29.1016
        Hardware version: 0
        Node GUID: 0x88e9a4ffff504094
        System image GUID: 0x88e9a4ffff504094
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 200
                Base lid: 7
                LMC: 0
                SM lid: 1
                Capability mask: 0x2651e848
                Port GUID: 0x88e9a4ffff504094
                Link layer: InfiniBand

Link verification result example

Check IB bandwidth communication

Check the IB bandwidth communication status (ib_send_bw) and follow the steps below to inspect the IaaS hardware level.

Check the name of the IB HCA interface.

In the following example, IB ports: mlx5_0, mlx5_4, mlx_5_5, mlx5_8

Color mode

~$ ibdev2netdev -v

~$ ibdev2netdev -v

Example of checking IB HCA interface name

Color mode

0000:1a:00.0 mlx5_0 (MT4129 - 1028SN     ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (ACTIVE) ==> ibp26s0 (Down)
0000:1b:00.0 mlx5_1 (MT4123 - 1028SN     ) Mellanox ConnectX-6 Single Port VPI HDR QSFP Adapter fw 20.38.1002 port 1 (ACTIVE) ==> bond-nas (Up)
0000:3c:00.0 mlx5_2 (MT4129 - 1028SN     ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (DOWN  ) ==> ibp60s0 (Down)
0000:4d:00.0 mlx5_3 (MT4129 - 1028SN     ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (DOWN  ) ==> ibp77s0 (Down)
0000:5e:00.0 mlx5_4 (MT4129 - 1028SN     ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (ACTIVE) ==> ibp94s0 (Down)
0000:9c:00.0 mlx5_5 (MT4129 - 1028SN     ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (ACTIVE) ==> ibp156s0 (Down)
0000:9d:00.0 mlx5_6 (MT4123 - 1028SN     ) Mellanox ConnectX-6 Single Port VPI HDR QSFP Adapter fw 20.38.1002 port 1 (ACTIVE) ==> bond-nas (Up)
0000:bc:00.0 mlx5_7 (MT4129 - 1028SN     ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (DOWN  ) ==> ibp188s0 (Down)
0000:cc:00.0 mlx5_8 (MT4129 - 1028SN     ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (ACTIVE) ==> ibp204s0 (Down)
0000:dc:00.0 mlx5_9 (MT4129 - 1028SN     ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (DOWN  ) ==> ibp220s0 (Down)

0000:1a:00.0 mlx5_0 (MT4129 - 1028SN     ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (ACTIVE) ==> ibp26s0 (Down)
0000:1b:00.0 mlx5_1 (MT4123 - 1028SN     ) Mellanox ConnectX-6 Single Port VPI HDR QSFP Adapter fw 20.38.1002 port 1 (ACTIVE) ==> bond-nas (Up)
0000:3c:00.0 mlx5_2 (MT4129 - 1028SN     ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (DOWN  ) ==> ibp60s0 (Down)
0000:4d:00.0 mlx5_3 (MT4129 - 1028SN     ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (DOWN  ) ==> ibp77s0 (Down)
0000:5e:00.0 mlx5_4 (MT4129 - 1028SN     ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (ACTIVE) ==> ibp94s0 (Down)
0000:9c:00.0 mlx5_5 (MT4129 - 1028SN     ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (ACTIVE) ==> ibp156s0 (Down)
0000:9d:00.0 mlx5_6 (MT4123 - 1028SN     ) Mellanox ConnectX-6 Single Port VPI HDR QSFP Adapter fw 20.38.1002 port 1 (ACTIVE) ==> bond-nas (Up)
0000:bc:00.0 mlx5_7 (MT4129 - 1028SN     ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (DOWN  ) ==> ibp188s0 (Down)
0000:cc:00.0 mlx5_8 (MT4129 - 1028SN     ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (ACTIVE) ==> ibp204s0 (Down)
0000:dc:00.0 mlx5_9 (MT4129 - 1028SN     ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (DOWN  ) ==> ibp220s0 (Down)

Example of IB HCA interface name verification result

Use the SERVER Side command to check the communication status.

Color mode

~$ ib_send_bw -d mlx5_0 -i 1 –F

~$ ib_send_bw -d mlx5_0 -i 1 –F

SERVER Side command example

Color mode

************************************
* Waiting for client to connect... * *
************************************
---------------------------------------------------------------------------------------
                    Send BW Test
 Dual-port       : OFF		Device         : mlx5_0
 Number of qps   : 1		Transport type : IB
 Connection type : RC		Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 TX depth        : 128
 CQ Moderation   : 100
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 0[B]
 rdma_cm QPs	 : OFF
 Data ex.  method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0x0f QPN 0x6d95 PSN 0xb974a
 remote address: LID 0x01 QPN 0x6dd2 PSN 0xc8a18c
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
 65536      1000             0.00               19827.40                   0.317238
---------------------------------------------------------------------------------------

************************************
* Waiting for client to connect... * *
************************************
---------------------------------------------------------------------------------------
                    Send BW Test
 Dual-port       : OFF		Device         : mlx5_0
 Number of qps   : 1		Transport type : IB
 Connection type : RC		Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 TX depth        : 128
 CQ Moderation   : 100
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 0[B]
 rdma_cm QPs	 : OFF
 Data ex.  method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0x0f QPN 0x6d95 PSN 0xb974a
 remote address: LID 0x01 QPN 0x6dd2 PSN 0xc8a18c
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
 65536      1000             0.00               19827.40                   0.317238
---------------------------------------------------------------------------------------

Example of mutual communication status check result

Use the CLIENT Side command to check the communication status.

Color mode

~$ ib_send_bw -d mlx5_0 -i 1 -F <SERVER Side IP>

~$ ib_send_bw -d mlx5_0 -i 1 -F <SERVER Side IP>

CLIENT Side command example

Color mode

---------------------------------------------------------------------------------------
                    Send BW Test
 Dual-port       : OFF		Device         : mlx5_0
 Number of qps   : 1		Transport type : IB
 Connection type : RC		Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 RX depth        : 512
 CQ Moderation   : 100
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 0[B]
 rdma_cm QPs	 : OFF
 Data ex.  method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0x01 QPN 0x6dd2 PSN 0xc8a18c
 remote address: LID 0x0f QPN 0x6d95 PSN 0xb974a
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
 65536      1000             19008.49            19006.37                  0.304102
---------------------------------------------------------------------------------------

---------------------------------------------------------------------------------------
                    Send BW Test
 Dual-port       : OFF		Device         : mlx5_0
 Number of qps   : 1		Transport type : IB
 Connection type : RC		Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 RX depth        : 512
 CQ Moderation   : 100
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 0[B]
 rdma_cm QPs	 : OFF
 Data ex.  method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0x01 QPN 0x6dd2 PSN 0xc8a18c
 remote address: LID 0x0f QPN 0x6d95 PSN 0xb974a
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
 65536      1000             19008.49            19006.37                  0.304102
---------------------------------------------------------------------------------------

Example of mutual communication status check result

Check IB service related kernel modules

Check the relevant kernel modules for the IB service (lsmod) to inspect the IaaS hardware level.

Check IB service-related kernel module - nvidia_peermem

Color mode

~$ lsmod | grep nvidia_peermem

~$ lsmod | grep nvidia_peermem

Example command to check nvidia_peermem

Color mode

nvidia_peermem         16384  0
ib_core               425984  9 rdma_cm,ib_ipoib,nvidia_peermem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
nvidia              56524800  451 nvidia_uvm,nvidia_peermem,nvidia_modeset

nvidia_peermem         16384  0
ib_core               425984  9 rdma_cm,ib_ipoib,nvidia_peermem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
nvidia              56524800  451 nvidia_uvm,nvidia_peermem,nvidia_modeset

Example of nvidia_peermem check result

Check IB service related kernel modules - IB Card (HCA) driver

Color mode

~$ lsmod|egrep 'ib_|_ib|mlx'

~$ lsmod|egrep 'ib_|_ib|mlx'

Example command to check the IB Card (HCA) driver

Color mode

mlx5_ib               393216  0
ib_uverbs             163840  2 irdma,mlx5_ib
ib_core               393216  3 irdma,ib_uverbs,mlx5_ib
mlx5_core            1593344  1 mlx5_ib
mlxfw                  32768  1 mlx5_core
psample                20480  1 mlx5_core
tls                   114688  1 mlx5_core
pci_hyperv_intf        16384  1 mlx5_core
ib_ipoib              139264  0
ib_cm                 131072  2 rdma_cm,ib_ipoib
ib_umad                40960  0
mlx5_ib               454656  0
ib_uverbs             135168  2 rdma_ucm,mlx5_ib
ib_core               434176  9 rdma_cm,ib_ipoib,nvidia_peermem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
libcrc32c              16384  5 nf_conntrack,nf_nat,btrfs,nf_tables,raid456
mlx5_core            2064384  1 mlx5_ib
mlx_compat             69632  11 rdma_cm,ib_ipoib,mlxdevm,iw_cm,ib_umad,ib_core,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm,mlx5_core

mlx5_ib               393216  0
ib_uverbs             163840  2 irdma,mlx5_ib
ib_core               393216  3 irdma,ib_uverbs,mlx5_ib
mlx5_core            1593344  1 mlx5_ib
mlxfw                  32768  1 mlx5_core
psample                20480  1 mlx5_core
tls                   114688  1 mlx5_core
pci_hyperv_intf        16384  1 mlx5_core
ib_ipoib              139264  0
ib_cm                 131072  2 rdma_cm,ib_ipoib
ib_umad                40960  0
mlx5_ib               454656  0
ib_uverbs             135168  2 rdma_ucm,mlx5_ib
ib_core               434176  9 rdma_cm,ib_ipoib,nvidia_peermem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
libcrc32c              16384  5 nf_conntrack,nf_nat,btrfs,nf_tables,raid456
mlx5_core            2064384  1 mlx5_ib
mlx_compat             69632  11 rdma_cm,ib_ipoib,mlxdevm,iw_cm,ib_umad,ib_core,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm,mlx5_core

Example of IB Card (HCA) driver verification result

Check storage physical disk resources and Multi-Path

Verify the storage physical disk resources and Multi-Path to assess the IaaS hardware level.

Storage Physical Disk Resource Check Results
Color mode
root@bm-dev-002:/tmp# fdisk –l
root@bm-dev-002:/tmp# fdisk –l
Example of storage physical disk resource check result
Multi-Path verification result
Color mode
root@bm-dev-002:/tmp# multipath –ll
root@bm-dev-002:/tmp# multipath –ll
Example of Multi-Path verification result

Check Service Network after new deployment of Multi-node GPU Cluster

Use the following command to verify that the MII Status of Bonding and Slave Interface is up.

Service Network check command
Color mode
~$ cat /proc/net/bonding/bond-srv
~$ cat /proc/net/bonding/bond-srv
Example command to check Service Network

Service Network check result

Color mode

Ethernet Channel Bonding Driver: v5.15.0-25-generic
Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: ens9f0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0
Peer Notification Delay (ms): 0

Slave Interface: ens9f0
MII Status: up
Speed: 100000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 30:3e:a7:02:35:70
Slave queue ID: 0

Slave Interface: ens11f0
MII Status: up
Speed: 100000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 30:3e:a7:02:2f:e8
Slave queue ID: 0

Ethernet Channel Bonding Driver: v5.15.0-25-generic
Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: ens9f0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0
Peer Notification Delay (ms): 0

Slave Interface: ens9f0
MII Status: up
Speed: 100000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 30:3e:a7:02:35:70
Slave queue ID: 0

Slave Interface: ens11f0
MII Status: up
Speed: 100000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 30:3e:a7:02:2f:e8
Slave queue ID: 0

Example of Service Network verification result

Reference

If some Slave Interface is in a down state, use the Contact of the Support Center to report the abnormal situation and receive a response.

Multi-node GPU Cluster: Verify time synchronization with the Time Server after new deployment

The OS image includes the installation of the chrony daemon and configuration for SCP NTP server synchronization. Use the following command to verify whether a line marked with ^* exists in the MS Name column.

Command to check the chrony daemon source and synchronization status
Color mode
~$ chronyc sources -V
~$ chronyc sources -V
Example of status check command

Result of checking chrony daemon status

Color mode

MS Name/IP address         Stratum Poll Reach LastRx Last sample
===============================================================================
^* 198.19.0.54                   4  10   377  1040    -16us[  -37us] +/- 9982us

MS Name/IP address         Stratum Poll Reach LastRx Last sample
===============================================================================
^* 198.19.0.54                   4  10   377  1040    -16us[  -37us] +/- 9982us

Example of checking the chrony daemon status

Multi-node GPU Cluster

1 - Overview

Service Overview

Provided Features

Component

Specifications by GPU Type

OS and GPU driver version

Server type

Preceding Service

1.1 - Server type

Multi-node GPU Cluster server type

g2 server type

g3 server type

1.2 - Monitoring Metrics

Multi-node GPU Cluster Monitoring Metrics

Multi-node GPU Cluster [Cluster]

Multi-node GPU Cluster [Node]

2 - How-to guides

Multi-node GPU Cluster Getting Started

Create GPU Node

Check GPU Node detailed information

Detailed Information

Tag

Job History

Control GPU Node operation

Getting Started with GPU Node

Stop GPU Node

Terminate GPU Node

2.1 - Cluster Fabric Management

Create Cluster Fabric

Check detailed information of Cluster Fabric

Detailed Information

Import Cluster Fabric server

Terminate Cluster Fabric

2.2 - Install ServiceWatch Agent

ServiceWatch Agent

Pre-configuration for Using ServiceWatch Agent

2.3 - Multi-node GPU Cluster Service Scope and Inspection Guide

Multi-node GPU Cluster Service Scope

IaaS HW Inspection Guide

Intel E810 driver update

Check NVIDIA driver

Check IB bandwidth communication

Check IB service related kernel modules

Check storage physical disk resources and Multi-Path

Check Service Network after new deployment of Multi-node GPU Cluster

Multi-node GPU Cluster: Verify time synchronization with the Time Server after new deployment

3 - Release Note

Multi-node GPU Cluster