This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Multi-node GPU Cluster

1: Overview

1.1: Server Type
1.2: Monitoring Metrics

2: How-to guides

2.1: Cluster Fabric Management
2.2: ServiceWatch Agent Install
2.3: Multi-node GPU Cluster Service Scope and Inspection Guide

3: Release Note

1 - Overview

Service Overview

Multi-node GPU Cluster is a service that provides physical GPU servers without virtualization for large-scale high-performance AI calculations. It can cluster multiple GPUs using two or more bare metal servers with GPUs, and can be used conveniently with Samsung Cloud Platform’s high-performance storage and networking services.

Provided Features

Multi-node GPU Cluster provides the following functions.

Auto Provisioning and Management: Through the web-based Console, you can easily use the standard GPU Bare Metal model server with 8 GPU cards, from provisioning to resource and cost management.
Network Connection: Two or more Bare Metal Servers can be clustered through high-speed interconnects to process multiple GPUs, and by configuring the GPU Direct RDMA (Remote Direct Memory Access) environment, direct data IO between GPU memories is possible, enabling high-speed AI/Machine Learning calculations.
Storage Connection: It provides various additional connection storages other than OS disks. High-speed network and high-performance SSD NAS File Storage, Block Storage, and Object Storage that are directly linked can also be used in conjunction.
Network Setting Management: The server’s subnet/IP can be easily changed with the initially set value. NAT IP provides a management function that can be used or cancelled according to needs.
Monitoring: You can check the monitoring information of computing resources such as CPU, GPU, Memory, Disk, etc. through Cloud Monitoring. To use the Cloud Monitoring service for Multi-node GPU Cluster, you need to install the Agent. Please install the Agent for stable service use. For more information, please refer to Multi-node GPU Cluster Monitoring Metrics.

Component

Multi-node GPU Cluster provides GPU as a Bare Metal Server type with standard images and server types, and NVSwitch and NVLink are provided.

GPU(H100)

GPU (Graphic Processing Unit) is specialized in parallel calculations that can process a large amount of data quickly, enabling large-scale parallel calculation processing in fields such as artificial intelligence (AI) and data analysis.

The following are the specifications of the GPU Type provided by the Multi-node GPU Cluster service.

Classification	H100 Type
Product Provisioning Method	Bare Metal
GPU Architecture	NNVIDIA Hopper
GPU Memory	80GB
GPU Transistors	80 billion 4N TSMC
GPU Tensor Performance(based on FP16)	989.4 TFLOPs, 1,978.9 TFLOPs*
GPU Memory Bandwidth	3,352 GB/sec HBM3
GPU CUDA Cores	16,896 Cores
GPU Tensor Cores	528(4th Generation)
NVLink performance	NVLink 4
Total NVLink bandwidth	900 GB/s
NVLink Signaling Rate	25 Gbps (x18)
NVSwitch performance	NVSwitch 3
NVSwitch GPU bandwidth	900 GB/s
Total NVSwitch Aggregate Bandwidth	7.2TB/s

With Sparsity

Table. GPU Type Specifications

OS and GPU Driver Version

The operating systems (OS) supported by Multi-node GPU Cluster are as follows.

OS	OS version	GPU driver version
Ubuntu	22.04	535.86.10, 535.183.06

Table. Multi-node GPU Cluster OS and GPU Driver Version

Server Type

The server types provided by Multi-node GPU Cluster are as follows. For a detailed description of the server types provided by Multi-node GPU Cluster, please refer to Multi-node GPU Cluster server type.

g2c96h8_metal

Classification	Example	Detailed Description
Server Generation	g2	Provided server generation g2: g means GPU server, and 2 means generation
CPU	c96	Number of Cores c96: Assigned Core is a physical core
GPU	h8	GPU type and quantity h8: h means GPU type, and 8 means GPU quantity

Table. Multi-node GPU Cluster server type format

Preceding Service

This is a list of services that must be pre-configured before creating this service. Please refer to the guide provided for each service and prepare in advance for more details.

Service Category	Service	Detailed Description
Networking	VPC	A service that provides an independent virtual network in a cloud environment

Fig. Multi-node GPU Cluster Pre-service

1.1 - Server Type

Multi-node GPU Cluster Server Type

Multi-node GPU Cluster is divided based on the provided GPU Type, and the GPU used in the Multi-node GPU Cluster is determined by the server type selected when creating a GPU Node. Please select the server type according to the specifications of the application you want to run in the Multi-node GPU Cluster.

The server types supported by Multi-node GPU Cluster are in the following format:

g2c96h8_metal

Classification	Example	Detailed Description
Server Generation	g2	Provided server generation g2 g means GPU server specification 2 means generation
CPU	c96	Number of cores c96: Assigned cores are physical cores
GPU	h8	GPU type and quantity h8: h means GPU type, and 8 means GPU quantity

Table. Multi-node GPU Cluster server type format

g2 Server Type

The g2 server type is a GPU Bare Metal Server using NVIDIA H100 Tensor Core GPU, suitable for large-scale high-performance AI computing.

Provides up to 8 NVIDIA H100 Tensor Core GPUs
Each GPU has 16,896 CUDA cores and 528 Tensor cores
Supports up to 96 vCPUs and 1,920 GB of memory
Supports up to 100 Gbps networking speed
900GB/s GPU and NVIDIA NVSwitch P2P communication

Server Type	GPU	GPU Memory	CPU(Core)	Memory	Disk	GPU P2P
g2c96h8_metal	H100	640 GB	96 vCore	2 TB	SSD(OS) 960 GB * 2, NVMeSSD 3.84 TB * 4	900GB/s NVSwitch

Table. Multi-node GPU Cluster server type specification > H100 server type

1.2 - Monitoring Metrics

Multi-node GPU Cluster monitoring metrics

The following table shows the monitoring metrics of Multi-node GPU Cluster that can be checked through Cloud Monitoring.

Guide

Multi-node GPU Cluster requires the user to install the Agent through the guide to view monitoring metrics. Please install the Agent before using the stable service. For the Agent installation method and detailed Cloud Monitoring usage, please refer to the Cloud Monitoring guide.

Multi-node GPU Cluster [Cluster]

Performance Item	Detailed Description	Unit
Memory Total [Basic]	Available memory bytes	bytes
Memory Used [Basic]	Currently used memory bytes	bytes
Memory Swap In [Basic]	Replaced memory bytes	bytes
Memory Swap Out [Basic]	Swapped out memory bytes	bytes
Memory Free [Basic]	Unused memory bytes	bytes
Disk Read Bytes [Basic]	Read bytes	bytes
Disk Read Requests [Basic]	Number of Read Requests	cnt
Disk Write Bytes [Basic]	Write bytes	bytes
Disk Write Requests [Basic]	Write Request Count	cnt
CPU Usage [Basic]	1-minute average system CPU usage rate	%
Instance State [Basic]	Instance Status	state
Network In Bytes [Basic]	Received bytes	bytes
Network In Dropped [Basic]	Receive Packet Drop	cnt
Network In Packets [Basic]	Received Packet Count	cnt
Network Out Bytes [Basic]	Transmission bytes	bytes
Network Out Dropped [Basic]	Transmission Packet Drop	cnt
Network Out Packets [Basic]	Transmission Packet Count	cnt

Fig. Multi-node GPU Cluster [Cluster] Monitoring Metrics (Default Provided)

Performance Item	Detailed Description	Unit
Cluster GPU Count	Cluster 내 GPU Count SUM Sum value of GPU Count of nodes in the cluster: Calculate the sum of GPU Count of each node in the same GPU CLUSTER	cnt
Cluster GPU Count In Use	Cluster 내 Job이 수행중인 GPU 수 Cluster internal Process using GPU count: the sum of the number of GPUs held by the process by parsing the ‘Processes:’ information at the bottom of the nvidia-smi result of the nodes in the same GPU CLUSTER	cnt
Cluster GPU Usage	Cluster internal GPU Utilization AVG Cluster internal node GPU utilization Average value: Average calculation of each node’s GPU utilization value among nodes in the same GPU CLUSTER	%
Cluster GPU Memory Usage [Avg]	Cluster GPU Memory Utilization AVG Cluster node Memory utilization Average value: Average calculation of Memory utilization values of each node in the same GPU cluster	%

Table. Multi-node GPU Cluster [Cluster] additional monitoring metrics (Agent installation required)

Multi-node GPU Cluster [Node]

Performance Item	Detailed Description	Unit
Memory Total [Basic]	Available memory bytes	bytes
Memory Used [Basic]	bytes of memory currently being used	bytes
Memory Swap In [Basic]	Replaced memory bytes	bytes
Memory Swap Out [Basic]	Swapped out memory bytes	bytes
Memory Free [Basic]	unused memory bytes	bytes
Disk Read Bytes [Basic]	Read bytes	bytes
Disk Read Requests [Basic] Disk Read Requests Count	cnt
Disk Write Bytes [Basic]	Write bytes	bytes
Disk Write Requests [Basic]	Write Request Count	cnt
CPU Usage [Basic]	1-minute average system CPU usage rate	%
Instance State [Basic]	Instance Status	state
Network In Bytes [Basic]	Received bytes	bytes
Network In Dropped [Basic]	Received Packet Drop	cnt
Network In Packets [Basic]	Received Packet Count	cnt
Network Out Bytes [Basic]	Transmission bytes	bytes
Network Out Dropped [Basic]	Transmission Packet Drop	cnt
Network Out Packets [Basic]	Transmission packet count	cnt

Fig. Multi-node GPU Cluster [Node] Monitoring Metrics (Default Provided)

Performance Item	Detailed Description	Unit
GPU Count	GPU count	cnt
GPU Temperature	GPU temperature	℃
GPU Usage	utilization	%
GPU Usage [Avg]	GPU Overall Average Usage Rate(%)	%
GPU Power Cap	GPU’s maximum power capacity	W
GPU Power Usage	GPU’s current power usage	W
GPU Memory Usage [Avg]	GPU Memory Uti. AVG	%
GPU Count in use	Node’s Job-running GPU count	cnt
Execution Status for nvidia-smi	nvidia-smi command execution result	status
Core Usage [IO Wait]	The ratio of CPU time spent in waiting state (disk waiting)	%
Core Usage [System]	The ratio of CPU time spent in kernel space	%
-Core Usage [User]-	-The ratio of CPU time spent in user space-	-%-
CPU Cores	The number of CPU cores on the host. The maximum value of the unnormalized ratio is 100%* of the cores. The unnormalized ratio already reflects this value, and the maximum value is 100%* of the cores.	cnt
CPU Usage [Active]	Percentage of CPU time used excluding Idle and IOWait states (if all 4 cores are used at 100%: 400%)	%
CPU Usage [Idle]	The ratio of CPU time spent in idle state.	%
CPU Usage [IO Wait]	The percentage of CPU time spent in waiting state (disk waiting)	%
CPU Usage [System]	Percentage of CPU time used by the kernel (in case of using all 4 cores 100%: 400%)	%
CPU Usage [User]	Percentage of CPU time used in the user area. (In case of using all 4 cores 100%, 400%)	%
CPU Usage/Core [Active]	Percentage of CPU time used excluding Idle and IOWait states (normalized value by number of cores, 100% if all 4 cores are used at 100%)	%
CPU Usage/Core [Idle]	The ratio of CPU time spent in idle state.	%
CPU Usage/Core [IO Wait]	The ratio of CPU time spent in waiting state (disk waiting)	%
CPU Usage/Core [System]	Percentage of CPU time used by the kernel (normalized value by number of cores, 100% if all 4 cores are used at 100%)	%
CPU Usage/Core [User]	Percentage of CPU time used in the user area. (normalized value by number of cores, 100% if all 4 cores are used at 100%)	%
Disk CPU Usage [IO Request]	The ratio of CPU time spent executing input/output requests for the device (device bandwidth utilization). If this value is close to 100%, the device is in a saturated state.	%
Disk Queue Size [Avg]	The average queue length of requests executed for the device.	num
Disk Read Bytes	The number of bytes read from the device per second.	bytes
Disk Read Bytes [Delta Avg]	Average of system.diskio.read.bytes_delta for each disk	bytes
Disk Read Bytes [Delta Max]	Individual disks’ system.diskio.read.bytes_delta maximum	bytes
Disk Read Bytes [Delta Min]	Individual disks’ minimum system.diskio.read.bytes_delta	bytes
Disk Read Bytes [Delta Sum]	The sum of system.diskio.read.bytes_delta of individual disks	bytes
Disk Read Bytes [Delta]	Delta value of system.diskio.read.bytes for each disk	bytes
Disk Read Bytes [Success]	The total number of bytes read successfully. In Linux, it is assumed that the sector size is 512 and the value multiplied by the number of sectors read by 512	bytes
Disk Read Requests	The number of read requests for the disk device in 1 second	cnt
Disk Read Requests [Delta Avg]	Average of system.diskio.read.count_delta for each disk	cnt
Disk Read Requests [Delta Max]	Maximum of system.diskio.read.count_delta for individual disks	cnt
Disk Read Requests [Delta Min]	Minimum of system.diskio.read.count_delta for each disk	cnt
Disk Read Requests [Delta Sum]	Sum of system.diskio.read.count_delta of individual disks	cnt
Disk Read Requests [Success Delta]	Individual disk’s system.diskio.read.count delta	cnt
Disk Read Requests [Success]	Total number of successful read completions	cnt
Disk Request Size [Avg]	The average size of requests executed for the device (unit: sector)	num
Disk Service Time [Avg]	The average service time (in milliseconds) for input requests executed on the device.	ms
Disk Wait Time [Avg]	The average time spent on requests executed for supported devices.	ms
Disk Wait Time [Read]	Disk Average Wait Time	ms
Disk Wait Time [Write]	Disk Average Wait Time	ms
Disk Write Bytes [Delta Avg]	Average of system.diskio.write.bytes_delta for each disk	bytes
Disk Write Bytes [Delta Max]	Maximum of system.diskio.write.bytes_delta for each disk	bytes
Disk Write Bytes [Delta Min]	Individual disks’ minimum system.diskio.write.bytes_delta	bytes
Disk Write Bytes [Delta Sum]	The sum of system.diskio.write.bytes_delta of individual disks	bytes
Disk Write Bytes [Delta]	Delta value of system.diskio.write.bytes for each disk	bytes
Disk Write Bytes [Success]	The total number of bytes written successfully. In Linux, it is assumed that the sector size is 512 and the value is multiplied by 512 to the number of sectors written	bytes
Disk Write Requests	The number of write requests to the disk device for 1 second	cnt
Disk Write Requests [Delta Avg]	Average of system.diskio.write.count_delta of individual disks	cnt
Disk Write Requests [Delta Max]	Maximum of system.diskio.write.count_delta for each disk	cnt
Disk Write Requests [Delta Min]	Minimum of system.diskio.write.count_delta for individual disks	cnt
Disk Write Requests [Delta Sum]	Sum of system.diskio.write.count_delta of individual disks	cnt
Disk Write Requests [Success Delta]	Individual disk’s system.diskio.write.count delta	cnt
Disk Write Requests [Success]	Total number of writes completed successfully	cnt
Disk Writes Bytes	The number of bytes written to the device per second.	bytes
Filesystem Hang Check	filesystem(local/NFS) hang check (normal:1, abnormal:0)	status
Filesystem Nodes	The total number of file nodes in the file system.	cnt
Filesystem Nodes [Free]	The total number of available file nodes in the file system.	cnt
Filesystem Size [Available]	This is the disk space (in bytes) that can be used by unauthorized users.	bytes
Filesystem Size [Free]	Available disk space (bytes)	bytes
Filesystem Size [Total]	Total Disk Space (bytes)	bytes
Filesystem Usage	Used Disk Space Percentage	%
Filesystem Usage [Avg]	Average of individual filesystem.used.pct	%
Filesystem Usage [Inode]	_inode usage rate	%
Filesystem Usage [Max]	Maximum value among individual filesystem usage percentages	%
Filesystem Usage [Min]	Minimum of individual filesystem used percentages	%
Filesystem Usage [Total]	-	%
Filesystem Used	Used Disk Space (bytes)	bytes
Filesystem Used [Inode]	Inode usage	bytes
Memory Free	The total amount of available memory (bytes). It does not include memory used by system cache and buffers (see system.memory.actual.free).	bytes
Memory Free [Actual]	Actual available memory (bytes). The calculation method varies depending on the OS, and in Linux, it is either MemAvailable from /proc/meminfo or calculated from available memory, cache, and buffer if meminfo is not available. On OSX, it is the sum of available memory and inactive memory. On Windows, it is the same as system.memory.free.	bytes
Memory Free [Swap]	Available swap memory.	bytes
Memory Total	Total Memory	bytes
Memory Total [Swap]	Total swap memory.	bytes
Memory Usage	Used memory percentage ((Memory Total - Memory Free) / Memory Total) * 100 Memory Free: Current available free memory capacity	%
Memory Usage [Actual]	The percentage of memory actually used ((Memory Total - Memory Available) / Memory Total) * 100 or ((Memory Total - (Memory Free + Buffers + Cached)) / Memory Total) * 100 Memory Free: The capacity of free memory currently available Buffers: The capacity of memory used by buffers Cached: The capacity of memory used by page cache	%
Memory Usage [Cache Swap]	Cache swap usage rate	%
Memory Usage [Swap]	Used swap memory percentage	%
Memory Used	Used Memory	bytes
Memory Used [Actual]	Actual used memory (bytes). The value subtracted from the total memory by the used memory. The available memory is calculated differently depending on the OS (refer to system.actual.free)	bytes
Memory Used [Swap]	Used swap memory.	bytes
Collisions	Network Collisions	cnt
Network In Bytes	Received byte count	bytes
Network In Bytes [Delta Avg]	Average of system.network.in.bytes_delta for each network	bytes
Network In Bytes [Delta Max]	Maximum of system.network.in.bytes_delta for each network	bytes
Network In Bytes [Delta Min]	Minimum of system.network.in.bytes_delta for each network	bytes
Network In Bytes [Delta Sum]	Sum of each network’s system.network.in.bytes_delta	bytes
Network In Bytes [Delta]	Received byte count delta	bytes
Network In Dropped	The number of packets deleted among incoming packets	cnt
Network In Errors	Number of errors during reception	cnt
Network In Packets	Received packet count	cnt
Network In Packets [Delta Avg]	Average of system.network.in.packets_delta for each network	cnt
Network In Packets [Delta Max]	Individual networks’ system.network.in.packets_delta maximum	cnt
Network In Packets [Delta Min]	Minimum of system.network.in.packets_delta for each network	cnt
Network In Packets [Delta Sum]	Sum of system.network.in.packets_delta of individual networks	cnt
Network In Packets [Delta]	Received packet count delta	cnt
Network Out Bytes	Transmitted byte count	bytes
Network Out Bytes [Delta Avg]	Average of system.network.out.bytes_delta for each network	bytes
Network Out Bytes [Delta Max]	Individual networks’ system.network.out.bytes_delta maximum	bytes
Network Out Bytes [Delta Min]	Minimum of system.network.out.bytes_delta for each network	bytes
Network Out Bytes [Delta Sum]	The sum of system.network.out.bytes_delta of individual networks	bytes
Network Out Bytes [Delta]	Transmitted byte count delta	bytes
Network Out Dropped	Number of packets dropped among outgoing packets. This value is not reported by the operating system, so it is always 0 in Darwin and BSD	cnt
Network Out Errors	Number of errors during transmission	cnt
Network Out Packets	Number of transmitted packets	cnt
Network Out Packets [Delta Avg]	Average of system.network.out.packets_delta for each network	cnt
Network Out Packets [Delta Max]	Maximum of system.network.out.packets_delta for each network	cnt
Network Out Packets [Delta Min]	Individual networks’ minimum system.network.out.packets_delta	cnt
Network Out Packets [Delta Sum]	Sum of system.network.out.packets_delta of individual networks	cnt
Network Out Packets [Delta]	Number of transmitted packets delta	cnt
Open Connections [TCP]	All open TCP connections	cnt
Open Connections [UDP]	All open UDP connections	cnt
Port Usage	Port usage available for connection	%
SYN Sent Sockets	Number of sockets in SYN_SENT state (when connecting from local to remote)	cnt
Kernel PID Max	kernel.pid_max value	cnt
Kernel Thread Max	kernel threads-max value	cnt
Process CPU Usage	Percentage of CPU time consumed by the process after the last update. This value is similar to the %CPU value of the process displayed by the top command on Unix systems	%
Process CPU Usage/Core	Percentage of CPU time used by the process since the last event, normalized by the number of cores, with a value between 0~100%	%
Process Memory Usage	main memory (RAM) where the process occupies a ratio	%
Process Memory Used	Resident Set size. The amount of memory a process occupies in RAM. In Windows, it is the current working set size	bytes
Process PID	Process PID	PID
Process PPID	Parent process’s pid	PID
Processes [Dead]	.dead processes count	cnt
Processes [Idle]	idle process count	cnt
Processes [Running]	Number of running processes	cnt
Processes [Sleeping]	sleeping processes count	cnt
Processes [Stopped]	Number of stopped processes	cnt
Processes [Total]	Total number of processes	cnt
Processes [Unknown]	Cannot search or unknown number of processes	cnt
Processes [Zombie]	Number of zombie processes	cnt
Running Process Usage	process usage rate	%
Running Processes	Number of running processes	cnt
Running Thread Usage	Thread usage rate	%
Running Threads	number of threads running in running processes	cnt
Instance Status	_instance status	state
Context Switches	context switch count (per second)	cnt
Load/Core [1 min]	Load for the last 1 minute divided by the number of cores	cnt
Load/Core [15 min]	The value of load divided by the number of cores for the last 15 minutes	cnt
Load/Core [5 min]	The value of load divided by the number of cores over the last 5 minutes	cnt
Multipaths [Active]	External storage connection path status = active count	cnt
Multipaths [Failed]	External storage connection path status = failed count	cnt
Multipaths [Faulty]	External storage connection path status = faulty count	cnt
NTP Offset	last sample’s measured offset (time difference between NTP server and local environment)	num
Run Queue Length	Execution Waiting Queue Length	num
Uptime	OS operation time (uptime). (milliseconds)	ms
Context Switchies	CPU context switch count (per second)	cnt
Disk Read Bytes [Sec]	number of bytes read from the windows logical disk in 1 second	cnt
Disk Read Time [Avg]	Data Read Average Time (sec)	sec
Disk Transfer Time [Avg]	Disk average wait time	sec
Disk Usage	Disk Usage Rate	%
Disk Write Bytes [Sec]	number of bytes written to the windows logical disk in 1 second	cnt
Disk Write Time [Avg]	Data Write Average Time (sec)	sec
Pagingfile Usage	Paging file usage rate	%
Pool Used [Non Paged]	Kernel memory Non-paged pool usage	bytes
Pool Used [Paged]	Paged Pool usage among kernel memory	bytes
Process [Running]	The number of processes currently running	cnt
Threads [Running]	Number of threads currently running	cnt
Threads [Waiting]	The number of threads waiting for processor time	cnt

Table. Multi-node GPU Cluster [Node] additional monitoring metrics (Agent installation required)

2 - How-to guides

The user can enter the required information for the Multi-node GPU Cluster service through the Samsung Cloud Platform Console, select detailed options, and create the service.

Multi-node GPU Cluster Getting Started

You can create and use a Multi-node GPU Cluster service in the Samsung Cloud Platform Console.

This service consists of GPU Node and Cluster Fabric services.

GPU Node Creation

To create a Multi-node GPU Cluster, follow the steps below.

All Services > Compute > Multi-node GPU Cluster Click the menu. Navigate to the Service Home page of Multi-node GPU Cluster.
Click the GPU Node creation button on the Service Home page. You will be taken to the GPU Node creation page.
GPU Node creation on the page, enter the information required to create the service, and select detailed options.
- Image and Version Selection Select the required information in the area.
  Category
  Required
  Detailed description
  Image Required Select provided image type
  Ubuntu
  Image Version Required Select version of the chosen image
  Provides a list of versions of the provided server images
  Table. GPU Node image and version selection items

Category	Required	Detailed description
Image	Required	Select provided image type Ubuntu
Image Version	Required	Select version of the chosen image Provides a list of versions of the provided server images

Enter service information area, input or select the required information.

Category	Required	Detailed description
Number of servers	Required	Number of GPU Node servers to create simultaneously Only numbers can be entered, and the minimum number of servers to create is 2. Only during the initial setup can you create 2 or more, and expansion is possible one at a time.
Service Type > Server Type	Required	GPU Node Server Type Select desired CPU, Memory, GPU, Disk specifications For detailed information about the server types provided by GPU Node, refer to Multi-node GPU Cluster Server Type
Service Type > Planned Compute	Required	Status of resources with Planned Compute set In Use: Number of resources with Planned Compute that are currently in use Configured: Number of resources with Planned Compute set Coverage Preview: Amount applied per resource by Planned Compute Planned Compute Service Application: Go to the Planned Compute service application page For details, refer to Apply for Planned Compute

Category

Required

Detailed description

Number of servers

Required

Number of GPU Node servers to create simultaneously

Only numbers can be entered, and the minimum number of servers to create is 2.

Only during the initial setup can you create 2 or more, and expansion is possible one at a time.

Service Type > Server Type

Required

GPU Node Server Type

Select desired CPU, Memory, GPU, Disk specifications

For detailed information about the server types provided by GPU Node, refer to Multi-node GPU Cluster Server Type

Service Type > Planned Compute

Required

Status of resources with Planned Compute set

In Use: Number of resources with Planned Compute that are currently in use

Configured: Number of resources with Planned Compute set

Coverage Preview: Amount applied per resource by Planned Compute

Planned Compute Service Application: Go to the Planned Compute service application page
- For details, refer to Apply for Planned Compute

Table. GPU Node Service Information Input Items

Required Information Input area, enter or select the required information.

Category

Category	Required or not	Detailed description
Administrator Account	Required	Set the administrator account and password to be used when connecting to the server Ubuntu OS is provided fixed as root
Server Name Prefix	Required	Enter a Prefix to distinguish each GPU Node generated when the number of selected servers is 2 or more Automatically generated as user input value (prefix) + ‘`-###`’ format Start with a lowercase English letter, and use lowercase letters, numbers, and special characters (`-`) within 3 to 11 characters Must not end with a special character (`-`)
Network Settings	Required	Set the network where the GPU Node will be installed VPC Name:Select a pre-created VPC General Subnet Name: Select a pre-created general Subnet IP can be set to auto-generate or user input, and if input is selected, the user enters the IP directly NAT: Can be used only when there is 1 server and the VPC has an Internet Gateway attached. Checking ‘use’ allows selection of a NAT IP. (When first created, it is generated only with 2 or more servers, so modify on the resource detail page) NAT IP: Select NAT IP If there is no NAT IP to select, click the Create New button to generate a Public IP Click the Refresh button to view and select the created Public IP Creating a Public IP incurs charges according to the Public IP pricing policy

Required or not

Detailed description

Administrator Account

Required

Set the administrator account and password to be used when connecting to the server

Ubuntu OS is provided fixed as root

Server Name Prefix

Required

Enter a Prefix to distinguish each GPU Node generated when the number of selected servers is 2 or more

Automatically generated as user input value (prefix) + ‘-###’ format

Start with a lowercase English letter, and use lowercase letters, numbers, and special characters (-) within 3 to 11 characters

Must not end with a special character (-)

Network Settings

Required

Set the network where the GPU Node will be installed

VPC Name:Select a pre-created VPC

General Subnet Name: Select a pre-created general Subnet
- IP can be set to auto-generate or user input, and if input is selected, the user enters the IP directly

NAT: Can be used only when there is 1 server and the VPC has an Internet Gateway attached. Checking ‘use’ allows selection of a NAT IP. (When first created, it is generated only with 2 or more servers, so modify on the resource detail page)

NAT IP: Select NAT IP
- If there is no NAT IP to select, click the Create New button to generate a Public IP
- Click the Refresh button to view and select the created Public IP
- Creating a Public IP incurs charges according to the Public IP pricing policy

Table. GPU Node required information entry items

Cluster selection area, create or select a Cluster Fabric.

Category	Required	Detailed description
Cluster Fabric	Required	Setting of a group of GPU Node servers that can apply GPU Direct RDMA together Optimal GPU performance and speed can be secured only within the same Cluster Fabric When creating a new Cluster Fabric, select *New Input > Node pool, then enter the name of the Cluster Fabric to be created To add to an existing Cluster Fabric, select Existing Input > Node pool, then select the already created Cluster Fabric

Category

Required

Detailed description

Cluster Fabric

Required

Setting of a group of GPU Node servers that can apply GPU Direct RDMA together

Optimal GPU performance and speed can be secured only within the same Cluster Fabric

When creating a new Cluster Fabric, select *New Input > Node pool, then enter the name of the Cluster Fabric to be created

To add to an existing Cluster Fabric, select Existing Input > Node pool, then select the already created Cluster Fabric

Table. GPU Node Cluster Fabric selection items

Additional Information Input Enter or select the required information in the area.

Category	Required or not	Detailed description
Lock	Select	Using Lock prevents accidental actions that could terminate/start/stop the server
Init Script	Select	Script to run when the server starts Init Script must be selected differently depending on the image type For Linux: Select Shell Script or cloud-init
Tag	Select	Add Tag Up to 50 can be added per resource After clicking the Add Tag button, enter or select Key, Value values

Category

Required or not

Detailed description

Lock

Select

Using Lock prevents accidental actions that could terminate/start/stop the server

Init Script

Select

Script to run when the server starts

Init Script must be selected differently depending on the image type
- For Linux: Select Shell Script or cloud-init

Tag

Select

Add Tag

Up to 50 can be added per resource

After clicking the Add Tag button, enter or select Key, Value values

Table. GPU Node additional information input items

Summary Check the detailed information and estimated billing amount generated in the panel, and click the Complete button.
- Once creation is complete, check the created resources on the GPU Node List page.

Caution

When creating a service, the GPU MIG/ECC settings are reset. However, to apply the correct settings, perform a one-time reboot initially, verify whether the settings have been applied, and then use it.
For detailed information about GPU MIG/ECC settings reset, please refer to GPU MIG/ECC Settings Reset Checklist Guide.

GPU Node Check Detailed Information

The Multi-node GPU Cluster service allows you to view and edit the full list of GPU Node resources and detailed information.

GPU Node Details page consists of Details, Tags, Job History tabs.

To view detailed information of the GPU Node, follow the steps below.

All Services > Compute > Multi-node GPU Cluster > GPU Node Click the menu. Navigate to the Service Home page of Multi-node GPU Cluster.

Click the GPU Node menu on the Service Home page. Navigate to the GPU Node List page.

Resource items other than required columns can be added via the Settings button.

Category	Required or not	Detailed description
Resource ID	Select	User-created GPU Node ID
Cluster Fabric name	required	Cluster Fabric name created by the user
Server name	Required	User-created GPU Node name
Server Type	Required	Server type of GPU Node User can check the number of cores, memory capacity, GPU type and count of the created resources
Image	Required	User-generated GPU Node image version
IP	Required	IP of the GPU Node created by the user
Status	Required	Status of the GPU Node created by the user
Creation Time	Select	GPU Node creation time

Table. GPU Node Resource List Items

GPU Node List Click the resource to view detailed information. GPU Node Details You will be taken to the page.

GPU Server Details At the top of the page, status information and descriptions of additional features are displayed.

Category	Detailed description
GPU Node status	Status of GPU Node created by the user Creating: State where the server is being created Running:: State where creation is complete and usable Editing:: State where IP is being changed Unknown: Error state Starting: State where the server is starting Stopping: State where the server is stopping Stopped: State where the server has stopped Terminating: State where termination is in progress Terminated: State where termination is complete
Server Control	Button to change server status Start: Start a stopped server Stop: Stop a running server
Service cancellation	Button to cancel the service

Category

Detailed description

GPU Node status

Status of GPU Node created by the user

Creating: State where the server is being created

Running:: State where creation is complete and usable

Editing:: State where IP is being changed

Unknown: Error state

Starting: State where the server is starting

Stopping: State where the server is stopping

Stopped: State where the server has stopped

Terminating: State where termination is in progress

Terminated: State where termination is complete

Server Control

Button to change server status

Start: Start a stopped server

Stop: Stop a running server

Service cancellation

Button to cancel the service

Table. GPU Node status information and additional features

Detailed Information

GPU Node List page’s Details Tab you can view the detailed information of the selected resource, and if necessary, edit the information.

Category	Detailed description
Service	Service Name
Resource Type	Resource Type
SRN	Unique resource ID in Samsung Cloud Platform In GPU Node, it means GPU Node SRN
Resource Name	Resource Name In the GPU Node service, it means the GPU Node name
Resource ID	Unique resource ID in the service
Creator	User who created the service
Creation time	Service creation time
Modifier	User who edited the service information
Modification date/time	Date and time when the service information was modified
Server name	Server name
Node pool	A group of nodes that can be grouped together with the same Cluster Fabric
Cluster Fabric name	Cluster Fabric name created by the user
Image/Version	Server’s OS image and version
Server Type	CPU, Memory, GPU, Information Display
Planned Compute	Resource status with Planned Compute set For more details, refer to Apply for Planned Compute
Lock	Display lock usage status If lock is used, it prevents server termination/start/stop to avoid accidental actions If you need to change the lock attribute value, click the Edit button to set
Network	GPU Node network information VPC name, general Subnet name, IP, IP status, NAT IP, NAT IP status
Block Storage	Block Storage information connected to the server Volume name, disk type, capacity, status
Init Script	View the Init Script content entered when creating the server

Table. GPU Node detailed information tab items

Category	Detailed description
Tag List	Tag List Tag’s Key, Value information can be checked Up to 50 tags can be added per resource When entering a tag, search and select from the existing list of Keys and Values

Work History

GPU Node List page’s Job History tab allows you to view the job history of the selected resource.

Category	Detailed description
Work History List	Resource Change History Work details, work date and time, resource type, resource name, event topic, work result, check worker information Detailed Search button provides detailed search function

Category

Detailed description

Work History List

Resource Change History

Work details, work date and time, resource type, resource name, event topic, work result, check worker information

Detailed Search button provides detailed search function

Table. GPU Node Job History Tab Detailed Information Items

GPU Node Operation Control

If you need server control and management functions for the generated GPU Node resources, you can perform tasks on the GPU Node List or GPU Node Details page. You can start and stop the running GPU Node resources.

GPU Node Getting Started

You can start a stopped GPU Node. To start the GPU Node, follow the steps below.

All Services > Compute > Multi-node GPU Cluster Click the menu. Navigate to the Service Home page of Multi-node GPU Cluster.
Click the GPU Node menu on the Service Home page. You will be taken to the GPU Node List page.
- On the GPU Node List page, after selecting individual or multiple servers with the checkbox, you can Start via the More button at the top.
GPU Node List page, click the resource. GPU Node Details page will be opened.
- GPU Node Details on the page, click the Start button at the top to start the server.
Check the server status and complete the status change.

Stop GPU Node

You can stop a GPU Node that is active. To stop the GPU Node, follow the steps below.

All Services > Compute > Multi-node GPU Cluster Click the menu. Move to Multi-node GPU Cluster’s Service Home page.
Click the GPU Node menu on the Service Home page. You will be taken to the GPU Node List page.
- GPU Node List page, you can control individual or multiple servers by selecting the checkboxes and then using the Stop button at the top.
GPU Node List page, click the resource. GPU Node Details page, navigate.
- GPU Node Details on the page, click the Stop button at the top to stop the server.
Check the server status and complete the status change.

GPU Node Cancel

You can cancel unused GPU nodes to reduce operating costs. However, if you cancel the service, the running service may be stopped immediately, so you should consider the impact of service interruption sufficiently before proceeding with the cancellation.

Caution

Please note that data cannot be recovered after service termination.

To cancel the GPU Node, follow the steps below.

All Services > Compute > Multi-node GPU Server Click the menu. Navigate to the Service Home page of the Multi-node GPU Cluster.
Click the Cluster Fabric menu on the Service Home page. You will be taken to the Cluster Fabric List page.
Cluster Fabric List page, select the resource to cancel, and click the Cancel Service button.
- Resources using the same Cluster Fabric can be terminated simultaneously.
Once the termination is complete, check on the GPU Node List page whether the resources have been terminated.

Guide

The cases where GPU Node termination is not possible are as follows.

When Block Storage (BM) is connected: Please disconnect the Block Storage (BM) connection first.
If File Storage is connected: Please disconnect the File Storage first.
When Lock is set: Please change the Lock setting to unused and try again.
If the server that cannot be terminated simultaneously is included: Please re-select only the resources that can be terminated.
If the Cluster Fabric of the server you want to terminate is different: Select only resources that use the same Cluster Fabric.

Reference

If all GPU Nodes in the Cluster Fabric are deleted, the Cluster Fabric is automatically deleted.

2.1 - Cluster Fabric Management

Cluster Fabric is a service that helps manage servers (GPU Nodes) included in a GPU Cluster. Using Cluster Fabric, you can move servers between GPU Clusters in the same Node pool and optimize the performance and speed of GPUs within the same GPU Cluster.

Creating Cluster Fabric

Cluster Fabric can be created together with a GPU Node, and it cannot be created or deleted separately. When all GPU Nodes within a Cluster Fabric are terminated, the Cluster Fabric is automatically deleted. If you haven’t created a GPU Node, please create one first. For more information, refer to Creating a GPU Node.

Checking Cluster Fabric Details

Guide

Cluster Fabric can be created together with a GPU Node, and it cannot be created or deleted separately.
When all GPU Nodes within a Cluster Fabric are terminated, the Cluster Fabric is automatically deleted.
If you haven’t created a GPU Node, please create one first. For more information, refer to Creating a GPU Node.

You can check the created Cluster Fabric list and details, and move servers on the Cluster Fabric List page and Cluster Fabric Details page.

Click on All Services > Compute > Multi-node GPU Server menu. It will move to the Service Home page of the Multi-node GPU Cluster.

Click on the Cluster Fabric menu on the Service Home page. It will move to the Cluster Fabric List page.

On the Cluster Fabric List page, you can view the list of resources of the GPU Cluster created by the user.

Resource items other than required columns can be added through the Settings button.

Category	Required	Description
Resource ID	Optional	Cluster Fabric ID created by the user
Cluster Fabric Name	Required	Cluster Fabric name created by the user
Node Pool	Optional	A collection of nodes that can be bundled into the same Cluster Fabric
Number of Servers	Optional	Number of GPU Nodes
Server Type	Optional	Server type of the GPU Node The user can check the number of cores, memory capacity, and GPU type and number of the created resource
Status	Optional	Status of the Cluster Fabric created by the user
Creation Time	Optional	Time when the Cluster Fabric was created

Table. Cluster Fabric resource list items

Click on the resource to check the details on the Cluster Fabric List page. It will move to the Cluster Fabric Details page.

At the top of the Cluster Fabric Details page, status information and additional feature descriptions are displayed.

Category	Description
Cluster Fabric Status	Status of the Cluster Fabric created by the user Creating: Cluster creation in progress Active: Creation completed and available Editing: IP change in progress Deleting: Termination in progress Deleted: Termination completed
Add Target Server	Function to move a server from another cluster to this cluster

Category

Description

Cluster Fabric Status

Status of the Cluster Fabric created by the user

Creating: Cluster creation in progress

Active: Creation completed and available

Editing: IP change in progress

Deleting: Termination in progress

Deleted: Termination completed

Add Target Server

Function to move a server from another cluster to this cluster

Table. Cluster Fabric status information and additional features

Details

On the Details tab of the Cluster Fabric List page, you can check the details of the selected resource and bring in servers from other clusters.

Category	Description
Service	Service category
Resource Type	Service name
SRN	Unique resource ID in Samsung Cloud Platform In Cluster Fabric, it means Cluster Fabric SRN
Resource Name	Resource name In Cluster Fabric service, it means Cluster Fabric name
Resource ID	Unique resource ID in the service
Creator	User who created the service
Creation Time	Time when the service was created
Modifier	User who modified the service information
Modification Time	Time when the service information was modified
Cluster Fabric Name	Cluster Fabric name created by the user
Node Pool	A collection of nodes that can be bundled into the same Cluster Fabric
Target Server	List of GPU Nodes bound to the Cluster Fabric Server name, server type, IP, status

Table. Cluster Fabric details tab items

Bringing in Cluster Fabric Servers

Using the Add Target Server feature on the Cluster Fabric Details page, you can bring in servers from other clusters and add them to the selected cluster.

Click on All Services > Compute > Multi-node GPU Server menu. It will move to the Service Home page of the Multi-node GPU Cluster.
Click on the Cluster Fabric menu on the Service Home page. It will move to the Cluster Fabric List page.
Click on the resource to check the details on the Cluster Fabric List page. It will move to the Cluster Fabric Details page.
Click the Add button on the right side of the target server on the details tab.
- The target server addition popup window opens.
  - Cluster Fabric Select a cluster.
  - The GPU Node bound to the selected cluster is retrieved, and you can select the GPU Node to bring in.
  - The selected GPU Node is listed at the bottom with the GPU Node name.
  - Click the Confirm button to complete.
  - Click the Cancel button to cancel the task.
- Check if the added GPU Node is retrieved in the target server.

Terminating Cluster Fabric

When all GPU Nodes within a Cluster Fabric are terminated, the Cluster Fabric is automatically deleted. For more information, refer to Terminating a GPU Node.

2.2 - ServiceWatch Agent Install

Users can install the ServiceWatch Agent on the GPU node of a Multi-node GPU Cluster to collect custom metrics and logs.

Reference

Collecting custom metrics/logs via ServiceWatch Agent is currently only available on Samsung Cloud Platform For Enterprise. It will be offered in other offerings in the future.

Caution

Since metric collection via ServiceWatch Agent is classified as custom metrics and incurs charges unlike the default collected metrics, it is recommended to remove or disable unnecessary metric collection settings.

ServiceWatch Agent

The agents that need to be installed on the GPU nodes of a multi-node GPU cluster for collecting ServiceWatch custom metrics and logs can be broadly divided into two types. This is Prometheus Exporter and Open Telemetry Collector.

Category	Detailed description
Prometheus Exporter	Provides metrics of a specific application or service in a format that Prometheus can scrape For collecting OS metrics of a GPU Node, you can use Node Exporter for Linux servers and Windows Exporter for Windows servers depending on the OS type. On a GPU Node, you can use Node Exporter for OS metric collection just like on a Virtual Server; see Virtual Server > ServiceWatch Agent for details You can use the DCGM (NVIDIA Data Center GPU Manager) Exporter for GPU Nodes Refer to GPU Server > Install ServiceWatch Agent for how to use the DCGM Exporter
Open Telemetry Collector	Acts as a centralized collector that gathers telemetry data such as metrics and logs from distributed systems, processes (filtering, sampling, etc.) them, and then exports to various backends (e.g., Prometheus, Jaeger, Elasticsearch, etc.) Exports data to the ServiceWatch Gateway so that ServiceWatch can collect metric and log data. Refer to ServiceWatch > Using ServiceWatch Agent for how to use the Open Telemetry Collector provided by ServiceWatch.

Category

Detailed description

Prometheus Exporter

Provides metrics of a specific application or service in a format that Prometheus can scrape

For collecting OS metrics of a GPU Node, you can use Node Exporter for Linux servers and Windows Exporter for Windows servers depending on the OS type.
- On a GPU Node, you can use Node Exporter for OS metric collection just like on a Virtual Server; see Virtual Server > ServiceWatch Agent for details

You can use the DCGM (NVIDIA Data Center GPU Manager) Exporter for GPU Nodes
- Refer to GPU Server > Install ServiceWatch Agent for how to use the DCGM Exporter

Open Telemetry Collector

Acts as a centralized collector that gathers telemetry data such as metrics and logs from distributed systems, processes (filtering, sampling, etc.) them, and then exports to various backends (e.g., Prometheus, Jaeger, Elasticsearch, etc.)

Exports data to the ServiceWatch Gateway so that ServiceWatch can collect metric and log data.

Refer to ServiceWatch > Using ServiceWatch Agent for how to use the Open Telemetry Collector provided by ServiceWatch.

Table. Description of Prometheus Exporter and Open Telemetry Collector

Notice

If you have configured Kubernetes Engine on a GPU Node, please check GPU metrics through the metrics provided by Kubernetes Engine.

If you install the DCGM Exporter on a GPU node where Kubernetes Engine is configured, it may not work properly.

Reference

The ServiceWatch Agent guide for collecting GPU metrics on a GPU Node can be used the same as on a GPU Server. For more details, see GPU Server > ServiceWatch Agent.

2.3 - Multi-node GPU Cluster Service Scope and Inspection Guide

Multi-node GPU Cluster service scope

In the event of an IaaS HW level issue with the Multi-node GPU Cluster service, technical support can be received through the Support Center’s Contact Us. However, risks due to changes such as OS Kernel updates or application installation are the responsibility of the user, so technical support may be difficult, please be cautious when performing system updates or other tasks.

IaaS HW level problem

IPMI(iLO) HW monitoring console where the server’s internal HW fault event occurrence message occurs
GPU HW operation error confirmed in nvdia-smi command
HW error messages occurring from InfiniBand HCA card or InfiniBand Switch inspection

Caution

Multi-node GPU Cluster is a service sensitive to software version compatibility of Ubuntu OS / NVDIA / Infiniband, so official technical support is not available after changes such as the user’s OS kernel update or application installation.

IaaS HW Inspection Guide

After applying for the Multi-node GPU Cluster service, it is recommended to check the IaaS HW level according to the inspection guide.

OS Kernel and Package Holding

Notice

If you do not want automatic updates of package versions, it is recommended to block package updates using the apt-mark command.
It is recommended to block the update of Linux kernel or IB related package versions.

To proceed with OS Kernel and Package holding, follow the procedure below.

Use the following command to check the version of the kernel and IB-related packages.

Color mode

root@bm-dev-001:~# dpkg -l | egrep -i "kernel | mlnx"
root@bm-dev-001:~# dpkg -l | egrep -i "kernel | nvidia"
root@bm-dev-001:~# dpkg -l | egrep -i "kernel | linux-image"
ii  crash                                 7.2.8-1ubuntu1.20.04.1                  amd64        kernel debugging utility, allowing gdb like syntax
ii  dkms                                  2.8.1-5ubuntu2                          all          Dynamic Kernel Module Support Framework
ii  dmeventd                              2:1.02.167-1ubuntu1                     amd64        Linux Kernel Device Mapper event daemon
ii  dmsetup                               2:1.02.167-1ubuntu1                     amd64        Linux Kernel Device Mapper userspace library
ii  iser-dkms                             5.4-OFED.5.4.3.0.1.1                    all          DKMS support fo iser kernel modules
ii  isert-dkms                            5.4-OFED.5.4.3.0.1.1                    all          DKMS support fo isert kernel modules
ii  kernel-mft-dkms                       4.17.2-12                               all          DKMS support for kernel-mft kernel modules
ii  kmod                                  27-1ubuntu2                             amd64        tools for managing Linux kernel modules
ii  knem                                  1.1.4.90mlnx1-OFED.5.1.2.5.0.1          amd64        userspace tools for the KNEM kernel module
ii  knem-dkms                             1.1.4.90mlnx1-OFED.5.1.2.5.0.1          all          DKMS support for mlnx-ofed kernel modules
ii  libaio1:amd64                         0.3.112-5                               amd64        Linux kernel AIO access library - shared library
ii  libdevmapper-event1.02.1:amd64        2:1.02.167-1ubuntu1                     amd64        Linux Kernel Device Mapper event support library
ii  libdevmapper1.02.1:amd64              2:1.02.167-1ubuntu1                     amd64        Linux Kernel Device Mapper userspace library
ii  libdrm-amdgpu1:amd64                  2.4.107-8ubuntu1~20.04.2                amd64        Userspace interface to amdgpu-specific kernel DRM services -- runtime
ii  libdrm-common                         2.4.107-8ubuntu1~20.04.2                all          Userspace interface to kernel DRM services -- common files
ii  libdrm-intel1:amd64                   2.4.107-8ubuntu1~20.04.2                amd64        Userspace interface to intel-specific kernel DRM services -- runtime
ii  libdrm-nouveau2:amd64                 2.4.107-8ubuntu1~20.04.2                amd64        Userspace interface to nouveau-specific kernel DRM services -- runtime
ii  libdrm-radeon1:amd64                  2.4.107-8ubuntu1~20.04.2                amd64        Userspace interface to radeon-specific kernel DRM services -- runtime
ii  libdrm2:amd64                         2.4.107-8ubuntu1~20.04.2                amd64        Userspace interface to kernel DRM services -- runtime
ii  linux-firmware                        1.187.29                                all          Firmware for Linux kernel drivers
hi  linux-generic                         5.4.0.105.109                           amd64        Complete Generic Linux kernel and headers
ii  linux-headers-5.4.0-104               5.4.0-104.118                           all          Header files related to Linux kernel version 5.4.0
ii  linux-headers-5.4.0-104-generic       5.4.0-104.118                           amd64        Linux kernel headers for version 5.4.0 on 64 bit x86 SMP
ii  linux-headers-5.4.0-105               5.4.0-105.119                           all          Header files related to Linux kernel version 5.4.0
ii  linux-headers-5.4.0-105-generic       5.4.0-105.119                           amd64        Linux kernel headers for version 5.4.0 on 64 bit x86 SMP
hi  linux-headers-generic                 5.4.0.105.109                           amd64        Generic Linux kernel headers
ii  linux-image-5.4.0-104-generic         5.4.0-104.118                           amd64        Signed kernel image generic
ii  linux-image-5.4.0-105-generic         5.4.0-105.119                           amd64        Signed kernel image generic
hi  linux-image-generic                   5.4.0.105.109                           amd64        Generic Linux kernel image
ii  linux-libc-dev:amd64                  5.4.0-105.119                           amd64        Linux Kernel Headers for development
ii  linux-modules-5.4.0-104-generic       5.4.0-104.118                           amd64        Linux kernel extra modules for version 5.4.0 on 64 bit x86 SMP
ii  linux-modules-5.4.0-105-generic       5.4.0-105.119                           amd64        Linux kernel extra modules for version 5.4.0 on 64 bit x86 SMP
ii  linux-modules-extra-5.4.0-104-generic 5.4.0-104.118                           amd64        Linux kernel extra modules for version 5.4.0 on 64 bit x86 SMP
ii  linux-modules-extra-5.4.0-105-generic 5.4.0-105.119                           amd64        Linux kernel extra modules for version 5.4.0 on 64 bit x86 SMP
ii  mlnx-ofed-kernel-dkms                 5.4-OFED.5.4.3.0.3.1                    all          DKMS support for mlnx-ofed kernel modules
ii  mlnx-ofed-kernel-utils                5.4-OFED.5.4.3.0.3.1                    amd64        Userspace tools to restart and tune mlnx-ofed kernel modules
ii  mlnx-tools                            5.2.0-0.54303                           amd64        Userspace tools to restart and tune MLNX_OFED kernel modules
ii  nvidia-kernel-common-470              470.103.01-0ubuntu0.20.04.1             amd64        Shared files used with the kernel module
ii  nvidia-kernel-source-470              470.103.01-0ubuntu0.20.04.1             amd64        NVIDIA kernel source package
ii  nvidia-peer-memory                    1.2-0                                   all          nvidia peer memory kernel module.
ii  nvidia-peer-memory-dkms               1.2-0                                   all          DKMS support for nvidia-peer-memory kernel modules
ii  rsyslog                               8.2001.0-1ubuntu1.1                     amd64        reliable system and kernel logging daemon
ii  srp-dkms                              5.4-OFED.5.4.3.0.1.1                    all          DKMS support fo srp kernel modules

root@bm-dev-001:~# dpkg -l | egrep -i "kernel | mlnx"
root@bm-dev-001:~# dpkg -l | egrep -i "kernel | nvidia"
root@bm-dev-001:~# dpkg -l | egrep -i "kernel | linux-image"
ii  crash                                 7.2.8-1ubuntu1.20.04.1                  amd64        kernel debugging utility, allowing gdb like syntax
ii  dkms                                  2.8.1-5ubuntu2                          all          Dynamic Kernel Module Support Framework
ii  dmeventd                              2:1.02.167-1ubuntu1                     amd64        Linux Kernel Device Mapper event daemon
ii  dmsetup                               2:1.02.167-1ubuntu1                     amd64        Linux Kernel Device Mapper userspace library
ii  iser-dkms                             5.4-OFED.5.4.3.0.1.1                    all          DKMS support fo iser kernel modules
ii  isert-dkms                            5.4-OFED.5.4.3.0.1.1                    all          DKMS support fo isert kernel modules
ii  kernel-mft-dkms                       4.17.2-12                               all          DKMS support for kernel-mft kernel modules
ii  kmod                                  27-1ubuntu2                             amd64        tools for managing Linux kernel modules
ii  knem                                  1.1.4.90mlnx1-OFED.5.1.2.5.0.1          amd64        userspace tools for the KNEM kernel module
ii  knem-dkms                             1.1.4.90mlnx1-OFED.5.1.2.5.0.1          all          DKMS support for mlnx-ofed kernel modules
ii  libaio1:amd64                         0.3.112-5                               amd64        Linux kernel AIO access library - shared library
ii  libdevmapper-event1.02.1:amd64        2:1.02.167-1ubuntu1                     amd64        Linux Kernel Device Mapper event support library
ii  libdevmapper1.02.1:amd64              2:1.02.167-1ubuntu1                     amd64        Linux Kernel Device Mapper userspace library
ii  libdrm-amdgpu1:amd64                  2.4.107-8ubuntu1~20.04.2                amd64        Userspace interface to amdgpu-specific kernel DRM services -- runtime
ii  libdrm-common                         2.4.107-8ubuntu1~20.04.2                all          Userspace interface to kernel DRM services -- common files
ii  libdrm-intel1:amd64                   2.4.107-8ubuntu1~20.04.2                amd64        Userspace interface to intel-specific kernel DRM services -- runtime
ii  libdrm-nouveau2:amd64                 2.4.107-8ubuntu1~20.04.2                amd64        Userspace interface to nouveau-specific kernel DRM services -- runtime
ii  libdrm-radeon1:amd64                  2.4.107-8ubuntu1~20.04.2                amd64        Userspace interface to radeon-specific kernel DRM services -- runtime
ii  libdrm2:amd64                         2.4.107-8ubuntu1~20.04.2                amd64        Userspace interface to kernel DRM services -- runtime
ii  linux-firmware                        1.187.29                                all          Firmware for Linux kernel drivers
hi  linux-generic                         5.4.0.105.109                           amd64        Complete Generic Linux kernel and headers
ii  linux-headers-5.4.0-104               5.4.0-104.118                           all          Header files related to Linux kernel version 5.4.0
ii  linux-headers-5.4.0-104-generic       5.4.0-104.118                           amd64        Linux kernel headers for version 5.4.0 on 64 bit x86 SMP
ii  linux-headers-5.4.0-105               5.4.0-105.119                           all          Header files related to Linux kernel version 5.4.0
ii  linux-headers-5.4.0-105-generic       5.4.0-105.119                           amd64        Linux kernel headers for version 5.4.0 on 64 bit x86 SMP
hi  linux-headers-generic                 5.4.0.105.109                           amd64        Generic Linux kernel headers
ii  linux-image-5.4.0-104-generic         5.4.0-104.118                           amd64        Signed kernel image generic
ii  linux-image-5.4.0-105-generic         5.4.0-105.119                           amd64        Signed kernel image generic
hi  linux-image-generic                   5.4.0.105.109                           amd64        Generic Linux kernel image
ii  linux-libc-dev:amd64                  5.4.0-105.119                           amd64        Linux Kernel Headers for development
ii  linux-modules-5.4.0-104-generic       5.4.0-104.118                           amd64        Linux kernel extra modules for version 5.4.0 on 64 bit x86 SMP
ii  linux-modules-5.4.0-105-generic       5.4.0-105.119                           amd64        Linux kernel extra modules for version 5.4.0 on 64 bit x86 SMP
ii  linux-modules-extra-5.4.0-104-generic 5.4.0-104.118                           amd64        Linux kernel extra modules for version 5.4.0 on 64 bit x86 SMP
ii  linux-modules-extra-5.4.0-105-generic 5.4.0-105.119                           amd64        Linux kernel extra modules for version 5.4.0 on 64 bit x86 SMP
ii  mlnx-ofed-kernel-dkms                 5.4-OFED.5.4.3.0.3.1                    all          DKMS support for mlnx-ofed kernel modules
ii  mlnx-ofed-kernel-utils                5.4-OFED.5.4.3.0.3.1                    amd64        Userspace tools to restart and tune mlnx-ofed kernel modules
ii  mlnx-tools                            5.2.0-0.54303                           amd64        Userspace tools to restart and tune MLNX_OFED kernel modules
ii  nvidia-kernel-common-470              470.103.01-0ubuntu0.20.04.1             amd64        Shared files used with the kernel module
ii  nvidia-kernel-source-470              470.103.01-0ubuntu0.20.04.1             amd64        NVIDIA kernel source package
ii  nvidia-peer-memory                    1.2-0                                   all          nvidia peer memory kernel module.
ii  nvidia-peer-memory-dkms               1.2-0                                   all          DKMS support for nvidia-peer-memory kernel modules
ii  rsyslog                               8.2001.0-1ubuntu1.1                     amd64        reliable system and kernel logging daemon
ii  srp-dkms                              5.4-OFED.5.4.3.0.1.1                    all          DKMS support fo srp kernel modules

Code block. Kernel, IB related package version check

Use the apt-mark command to hold the package update.
Color mode
# apt-mark hold <package name>
# apt-mark hold <package name>
Code block. Package update hold

Intel E810 Driver Update

Check the version of the Intel E810 driver and update it to the recommended version.

Notice

Server manufacturer Intel E810 driver recommended version: 1.15.4
Driver (ice-1.15.4.tar.gz) download

The driver update method is as follows.

Move the basic driver tar file to the desired directory.

Example: /home/username/ice or /usr/local/src/ice

Untar / unzip the Archiver file.
- x.x.x is the version number of the driver tar file.
  Color mode
  tar zxf ice-x.x.x.tar.gz
  tar zxf ice-x.x.x.tar.gz
  Code block. Unzip file
Change to the driver src directory.
- x.x.x is the version number of the driver tar file.
  Color mode
  cd ice-x.x.x/src/
  cd ice-x.x.x/src/
  Code block. Directory change
Compile the driver module.
Color mode
make install
make install
Code Block. Driver Module Compile
After the update is complete, check the version.
Color mode
lsmod | grep ice modinfo ice | grep version
lsmod | grep ice modinfo ice | grep version
Code Block. Version Check

NVIDIA driver check

Note

nvidia-smi topo, IB nv_peer_mem status check

To check the NVIDIA driver (nvidia-smi topo, IB nv_peer_mem status) and inspect the IaaS HW level, follow the next procedure.

Check the GPU driver and HW status.

Color mode

user@bm-dev-001:~$ nvidia-smi topo -m
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    mlx5_0  mlx5_1  mlx5_2  mlx5_3  CPU Affinity    NUMA Affinity
GPU0     X      NV12    NV12    NV12    NV12    NV12    NV12    NV12    SYS     PXB     SYS     SYS     48-63   3
GPU1    NV12     X      NV12    NV12    NV12    NV12    NV12    NV12    SYS     PXB     SYS     SYS     48-63   3
GPU2    NV12    NV12     X      NV12    NV12    NV12    NV12    NV12    PXB     SYS     SYS     SYS     16-31   1
GPU3    NV12    NV12    NV12     X      NV12    NV12    NV12    NV12    PXB     SYS     SYS     SYS     16-31   1
GPU4    NV12    NV12    NV12    NV12     X      NV12    NV12    NV12    SYS     SYS     SYS     PXB     112-127 7
GPU5    NV12    NV12    NV12    NV12    NV12     X      NV12    NV12    SYS     SYS     SYS     PXB     112-127 7
GPU6    NV12    NV12    NV12    NV12    NV12    NV12     X      NV12    SYS     SYS     PXB     SYS     80-95   5
GPU7    NV12    NV12    NV12    NV12    NV12    NV12    NV12     X      SYS     SYS     PXB     SYS     80-95   5
mlx5_0  SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS      X      SYS     SYS     SYS
mlx5_1  PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      SYS     SYS
mlx5_2  SYS     SYS     SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS      X      SYS
mlx5_3  SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS      X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

user@bm-dev-001:~$ nvidia-smi topo -m
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    mlx5_0  mlx5_1  mlx5_2  mlx5_3  CPU Affinity    NUMA Affinity
GPU0     X      NV12    NV12    NV12    NV12    NV12    NV12    NV12    SYS     PXB     SYS     SYS     48-63   3
GPU1    NV12     X      NV12    NV12    NV12    NV12    NV12    NV12    SYS     PXB     SYS     SYS     48-63   3
GPU2    NV12    NV12     X      NV12    NV12    NV12    NV12    NV12    PXB     SYS     SYS     SYS     16-31   1
GPU3    NV12    NV12    NV12     X      NV12    NV12    NV12    NV12    PXB     SYS     SYS     SYS     16-31   1
GPU4    NV12    NV12    NV12    NV12     X      NV12    NV12    NV12    SYS     SYS     SYS     PXB     112-127 7
GPU5    NV12    NV12    NV12    NV12    NV12     X      NV12    NV12    SYS     SYS     SYS     PXB     112-127 7
GPU6    NV12    NV12    NV12    NV12    NV12    NV12     X      NV12    SYS     SYS     PXB     SYS     80-95   5
GPU7    NV12    NV12    NV12    NV12    NV12    NV12    NV12     X      SYS     SYS     PXB     SYS     80-95   5
mlx5_0  SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS      X      SYS     SYS     SYS
mlx5_1  PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      SYS     SYS
mlx5_2  SYS     SYS     SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS      X      SYS
mlx5_3  SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS      X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

Code Block. GPU Driver and HW Status Check

Check the NVSwitch HW status.

Copy Code

Color mode

user@bm-dev-001:~$ nvidia-smi nvlink --status 0: NVIDIA A100-SXM4-80GB (UUID: GPU-2c0d1d6b-e348-55fc-44cf-cd65a954b36c) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s 1: NVIDIA A100-SXM4-80GB (UUID: GPU-96f429d8-893a-a9ea-deca-feffd90669e9) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s 2: NVIDIA A100-SXM4-80GB (UUID: GPU-2e601952-b442-b757-a035-725cd320f589) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s 3: NVIDIA A100-SXM4-80GB (UUID: GPU-bcbfd885-a9f8-ec8c-045b-c521472b4fed) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s 4: NVIDIA A100-SXM4-80GB (UUID: GPU-30273090-2d78-fc7a-a360-ec5f871dd488) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s 5: NVIDIA A100-SXM4-80GB (UUID: GPU-5ce7ef61-56dd-fb18-aa7c-be610c8d51c3) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s 6: NVIDIA A100-SXM4-80GB (UUID: GPU-740a527b-b286-8b85-35eb-b6b41c0bb6d7) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s 7: NVIDIA A100-SXM4-80GB (UUID: GPU-1fb6de95-60f6-dbf2-ffca-f7680577e37c) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s

user@bm-dev-001:~$ nvidia-smi nvlink --status 0: NVIDIA A100-SXM4-80GB (UUID: GPU-2c0d1d6b-e348-55fc-44cf-cd65a954b36c) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s 1: NVIDIA A100-SXM4-80GB (UUID: GPU-96f429d8-893a-a9ea-deca-feffd90669e9) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s 2: NVIDIA A100-SXM4-80GB (UUID: GPU-2e601952-b442-b757-a035-725cd320f589) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s 3: NVIDIA A100-SXM4-80GB (UUID: GPU-bcbfd885-a9f8-ec8c-045b-c521472b4fed) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s 4: NVIDIA A100-SXM4-80GB (UUID: GPU-30273090-2d78-fc7a-a360-ec5f871dd488) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s 5: NVIDIA A100-SXM4-80GB (UUID: GPU-5ce7ef61-56dd-fb18-aa7c-be610c8d51c3) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s 6: NVIDIA A100-SXM4-80GB (UUID: GPU-740a527b-b286-8b85-35eb-b6b41c0bb6d7) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s 7: NVIDIA A100-SXM4-80GB (UUID: GPU-1fb6de95-60f6-dbf2-ffca-f7680577e37c) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s

Code block. NVSwitch HW status check

Check the InfiniBand(IB) HCA card HW status and Link.

Color mode

user@bm-dev-001:~$ ibdev2netdev -v
cat: /sys/class/infiniband/mlx5_0/device/vpd: Permission denied
0000:45:00.0 mlx5_0 (MT4123 -            )                 fw 20.29.1016 port 1 (ACTIVE) ==> ibs18 (Down)
cat: /sys/class/infiniband/mlx5_1/device/vpd: Permission denied
0000:0e:00.0 mlx5_1 (MT4123 -            )                 fw 20.29.1016 port 1 (ACTIVE) ==> ibs17 (Down)
cat: /sys/class/infiniband/mlx5_2/device/vpd: Permission denied
0000:c5:00.0 mlx5_2 (MT4123 -            )                 fw 20.29.1016 port 1 (ACTIVE) ==> ibs20 (Down)
cat: /sys/class/infiniband/mlx5_3/device/vpd: Permission denied
0000:85:00.0 mlx5_3 (MT4123 -            )                 fw 20.29.1016 port 1 (ACTIVE) ==> ibs19 (Down)
user@bm-dev-001:~$


root@bm-dev-001:~# ibstat
CA 'mlx5_0'
        CA type: MT4123
        Number of ports: 1
        Firmware version: 20.29.1016
        Hardware version: 0
        Node GUID: 0x88e9a4ffff5060ac
        System image GUID: 0x88e9a4ffff5060ac
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 200
                Base lid: 8
                LMC: 0
                SM lid: 1
                Capability mask: 0x2651e848
                Port GUID: 0x88e9a4ffff5060ac
                Link layer: InfiniBand
CA 'mlx5_1'
        CA type: MT4123
        Number of ports: 1
        Firmware version: 20.29.1016
        Hardware version: 0
        Node GUID: 0x88e9a4ffff504080
        System image GUID: 0x88e9a4ffff504080
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 200
                Base lid: 5
                LMC: 0
                SM lid: 1
                Capability mask: 0x2651e848
                Port GUID: 0x88e9a4ffff504080
                Link layer: InfiniBand
CA 'mlx5_2'
        CA type: MT4123
        Number of ports: 1
        Firmware version: 20.29.1016
        Hardware version: 0
        Node GUID: 0x88e9a4ffff505038
        System image GUID: 0x88e9a4ffff505038
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 200
                Base lid: 2
                LMC: 0
                SM lid: 1
                Capability mask: 0x2651e848
                Port GUID: 0x88e9a4ffff505038
                Link layer: InfiniBand
CA 'mlx5_3'
        CA type: MT4123
        Number of ports: 1
        Firmware version: 20.29.1016
        Hardware version: 0
        Node GUID: 0x88e9a4ffff504094
        System image GUID: 0x88e9a4ffff504094
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 200
                Base lid: 7
                LMC: 0
                SM lid: 1
                Capability mask: 0x2651e848
                Port GUID: 0x88e9a4ffff504094
                Link layer: InfiniBand

user@bm-dev-001:~$ ibdev2netdev -v
cat: /sys/class/infiniband/mlx5_0/device/vpd: Permission denied
0000:45:00.0 mlx5_0 (MT4123 -            )                 fw 20.29.1016 port 1 (ACTIVE) ==> ibs18 (Down)
cat: /sys/class/infiniband/mlx5_1/device/vpd: Permission denied
0000:0e:00.0 mlx5_1 (MT4123 -            )                 fw 20.29.1016 port 1 (ACTIVE) ==> ibs17 (Down)
cat: /sys/class/infiniband/mlx5_2/device/vpd: Permission denied
0000:c5:00.0 mlx5_2 (MT4123 -            )                 fw 20.29.1016 port 1 (ACTIVE) ==> ibs20 (Down)
cat: /sys/class/infiniband/mlx5_3/device/vpd: Permission denied
0000:85:00.0 mlx5_3 (MT4123 -            )                 fw 20.29.1016 port 1 (ACTIVE) ==> ibs19 (Down)
user@bm-dev-001:~$


root@bm-dev-001:~# ibstat
CA 'mlx5_0'
        CA type: MT4123
        Number of ports: 1
        Firmware version: 20.29.1016
        Hardware version: 0
        Node GUID: 0x88e9a4ffff5060ac
        System image GUID: 0x88e9a4ffff5060ac
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 200
                Base lid: 8
                LMC: 0
                SM lid: 1
                Capability mask: 0x2651e848
                Port GUID: 0x88e9a4ffff5060ac
                Link layer: InfiniBand
CA 'mlx5_1'
        CA type: MT4123
        Number of ports: 1
        Firmware version: 20.29.1016
        Hardware version: 0
        Node GUID: 0x88e9a4ffff504080
        System image GUID: 0x88e9a4ffff504080
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 200
                Base lid: 5
                LMC: 0
                SM lid: 1
                Capability mask: 0x2651e848
                Port GUID: 0x88e9a4ffff504080
                Link layer: InfiniBand
CA 'mlx5_2'
        CA type: MT4123
        Number of ports: 1
        Firmware version: 20.29.1016
        Hardware version: 0
        Node GUID: 0x88e9a4ffff505038
        System image GUID: 0x88e9a4ffff505038
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 200
                Base lid: 2
                LMC: 0
                SM lid: 1
                Capability mask: 0x2651e848
                Port GUID: 0x88e9a4ffff505038
                Link layer: InfiniBand
CA 'mlx5_3'
        CA type: MT4123
        Number of ports: 1
        Firmware version: 20.29.1016
        Hardware version: 0
        Node GUID: 0x88e9a4ffff504094
        System image GUID: 0x88e9a4ffff504094
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 200
                Base lid: 7
                LMC: 0
                SM lid: 1
                Capability mask: 0x2651e848
                Port GUID: 0x88e9a4ffff504094
                Link layer: InfiniBand

Code block. InfiniBand(IB) HCA card HW status and Link check

IB bandwidth communication check

To check the IB bandwidth communication status (ib_send_bw) and inspect the IaaS HW level, follow these steps.

Check the name of the IB HCA interface.

Color mode

user@bm-dev-001:~$ ibdev2netdev
mlx5_0 port 1 ==> ibs18 (Down)
mlx5_1 port 1 ==> ibs17 (Down)
mlx5_2 port 1 ==> ibs20 (Down)
mlx5_3 port 1 ==> ibs19 (Down)

user@bm-dev-001:~$ ibdev2netdev
mlx5_0 port 1 ==> ibs18 (Down)
mlx5_1 port 1 ==> ibs17 (Down)
mlx5_2 port 1 ==> ibs20 (Down)
mlx5_3 port 1 ==> ibs19 (Down)

Code block. Check the name of IB HCA interface

Check the HCA interface that can communicate with IB Switch#1.

Color mode

mlx5_0 port 1 ==> ibs18 (Down)
mlx5_2 port 1 ==> ibs20 (Down)

mlx5_0 port 1 ==> ibs18 (Down)
mlx5_2 port 1 ==> ibs20 (Down)

Code Block. HCA Interface Check

Check the HCA interface that can communicate with IB Switch#2.

Color mode

mlx5_1 port 1 ==> ibs17 (Down)
mlx5_3 port 1 ==> ibs19 (Down)

mlx5_1 port 1 ==> ibs17 (Down)
mlx5_3 port 1 ==> ibs19 (Down)

Code Block. HCA Interface Check

Use SERVER Side commands to check the communication status.

Client Side command is entered secondarily for mutual communication

Color mode

user@bm-dev-001:~$ ib_send_bw -d mlx5_3 -i 1 –F
************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
                    Send BW Test
 Dual-port       : OFF          Device         : mlx5_3
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 RX depth        : 512
 CQ Moderation   : 1
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0x07 QPN 0x002e PSN 0xa86622
 remote address: LID 0x0a QPN 0x002d PSN 0xfc58dd
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
 65536      1000             0.00               19827.40                   0.317238
---------------------------------------------------------------------------------------

user@bm-dev-001:~$ ib_send_bw -d mlx5_3 -i 1 –F
************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
                    Send BW Test
 Dual-port       : OFF          Device         : mlx5_3
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 RX depth        : 512
 CQ Moderation   : 1
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0x07 QPN 0x002e PSN 0xa86622
 remote address: LID 0x0a QPN 0x002d PSN 0xfc58dd
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
 65536      1000             0.00               19827.40                   0.317238
---------------------------------------------------------------------------------------

Code Block. Communication Status Check

Use the CLIENT Side command to check the communication status.

SERVER Side command is entered first for mutual communication

Color mode

root@bm-dev-003:~# ib_send_bw -d mlx5_3 -i 1 -F <SERVER Side IP>
---------------------------------------------------------------------------------------
                    Send BW Test
 Dual-port       : OFF          Device         : mlx5_3
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 TX depth        : 128
 CQ Moderation   : 1
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0x0a QPN 0x002a PSN 0x98a48e
 remote address: LID 0x07 QPN 0x002c PSN 0xe68304
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
 65536      1000             19008.49            19006.37                  0.304102
---------------------------------------------------------------------------------------

root@bm-dev-003:~# ib_send_bw -d mlx5_3 -i 1 -F <SERVER Side IP>
---------------------------------------------------------------------------------------
                    Send BW Test
 Dual-port       : OFF          Device         : mlx5_3
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 TX depth        : 128
 CQ Moderation   : 1
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0x0a QPN 0x002a PSN 0x98a48e
 remote address: LID 0x07 QPN 0x002c PSN 0xe68304
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
 65536      1000             19008.49            19006.37                  0.304102
---------------------------------------------------------------------------------------

Code Block. Communication Status Check

Check the IB service-related kernel modules (lsmod) to inspect the IaaS HW level.

Color mode

user@bm-dev-001:~$ lsmod | grep nv_peer_mem
nv_peer_mem            16384  0
ib_core               315392  9 rdma_cm,ib_ipoib,nv_peer_mem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
nvidia              35315712  156 nvidia_uvm,nv_peer_mem,nvidia_modeset

user@bm-dev-001:~$ lsmod | grep nv_peer_mem
nv_peer_mem            16384  0
ib_core               315392  9 rdma_cm,ib_ipoib,nv_peer_mem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
nvidia              35315712  156 nvidia_uvm,nv_peer_mem,nvidia_modeset

Code block. IB service related kernel module check(1)

Color mode

user@bm-dev-001:~$ service nv_peer_mem status
  nv_peer_mem.service - LSB: Activates/Deactivates nv_peer_mem to \ start at boot time.
     Loaded: loaded (/etc/init.d/nv_peer_mem; generated)
     Active: active (exited) since Mon 2023-03-13 16:21:33 KST; 2 days ago
       Docs: man:systemd-sysv-generator(8)
    Process: 4913 ExecStart=/etc/init.d/nv_peer_mem start (code=exited, status=0/SUCCESS)

user@bm-dev-001:~$ service nv_peer_mem status
  nv_peer_mem.service - LSB: Activates/Deactivates nv_peer_mem to \ start at boot time.
     Loaded: loaded (/etc/init.d/nv_peer_mem; generated)
     Active: active (exited) since Mon 2023-03-13 16:21:33 KST; 2 days ago
       Docs: man:systemd-sysv-generator(8)
    Process: 4913 ExecStart=/etc/init.d/nv_peer_mem start (code=exited, status=0/SUCCESS)

Code block. IB service-related kernel module check(2)

Color mode

user@bm-dev-001:~$ lsmod | grep ib
libiscsi_tcp           32768  1 iscsi_tcp
libiscsi               57344  2 libiscsi_tcp,iscsi_tcp
scsi_transport_iscsi   110592  4 libiscsi_tcp,iscsi_tcp,libiscsi
ib_ipoib              131072  0
ib_cm                  57344  2 rdma_cm,ib_ipoib
ib_umad                24576  8
mlx5_ib               380928  0
ib_uverbs             135168  18 rdma_ucm,mlx5_ib
ib_core               315392  9 rdma_cm,ib_ipoib,nv_peer_mem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
libcrc32c              16384  2 btrfs,raid456
mlx5_core            1458176  1 mlx5_ib
auxiliary              16384  2 mlx5_ib,mlx5_core
mlx_compat             65536  12 rdma_cm,ib_ipoib,mlxdevm,iw_cm,auxiliary,ib_umad,ib_core,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm,mlx5_core

user@bm-dev-001:~$ lsmod | grep ib
libiscsi_tcp           32768  1 iscsi_tcp
libiscsi               57344  2 libiscsi_tcp,iscsi_tcp
scsi_transport_iscsi   110592  4 libiscsi_tcp,iscsi_tcp,libiscsi
ib_ipoib              131072  0
ib_cm                  57344  2 rdma_cm,ib_ipoib
ib_umad                24576  8
mlx5_ib               380928  0
ib_uverbs             135168  18 rdma_ucm,mlx5_ib
ib_core               315392  9 rdma_cm,ib_ipoib,nv_peer_mem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
libcrc32c              16384  2 btrfs,raid456
mlx5_core            1458176  1 mlx5_ib
auxiliary              16384  2 mlx5_ib,mlx5_core
mlx_compat             65536  12 rdma_cm,ib_ipoib,mlxdevm,iw_cm,auxiliary,ib_umad,ib_core,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm,mlx5_core

Code block. IB service-related kernel module check(3)

Storage Physical Disk Resources and Multi-Path Check

Check the IaaS HW level by checking the storage physical disk resources and Multi-Path.

Color mode

root@bm-dev-002:/tmp# fdisk –l

root@bm-dev-002:/tmp# fdisk –l

Code block. Storage physical disk resource check result

Color mode

root@bm-dev-002:/tmp# multipath –ll

root@bm-dev-002:/tmp# multipath –ll

Code Block. Multi-Path Verification Result

Multi-node GPU Cluster new deployment after checking Service Network

Use the following command to check if the MII Status of Bonding and Slave Interface is up.

command

Color mode

root@mngc-001:~# cat /proc/net/bonding/bond-srv
Ethernet Channel Bonding Driver: v5.15.0-25-generic

root@mngc-001:~# cat /proc/net/bonding/bond-srv
Ethernet Channel Bonding Driver: v5.15.0-25-generic

Code Block. Service Network Check Command

confirmation result

Color mode

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: ens9f0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0
Peer Notification Delay (ms): 0

Slave Interface: ens9f0
MII Status: up
Speed: 100000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 30:3e:a7:02:35:70
Slave queue ID: 0

Slave Interface: ens11f0
MII Status: up
Speed: 100000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 30:3e:a7:02:2f:e8
Slave queue ID: 0

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: ens9f0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0
Peer Notification Delay (ms): 0

Slave Interface: ens9f0
MII Status: up
Speed: 100000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 30:3e:a7:02:35:70
Slave queue ID: 0

Slave Interface: ens11f0
MII Status: up
Speed: 100000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 30:3e:a7:02:2f:e8
Slave queue ID: 0

Code Block. Service Network Check Command Result

Reference

If some Slave Interface is in a down state, please use the Support Center’s Contact Us to report the abnormal situation and take action.

Multi-node GPU Cluster new deployment after checking Time Server and time synchronization

The OS image has the chrony daemon installed and set to synchronize with the SCP NTP server. Use the following command to check if there are any lines marked with ^* in the MS Name column.

command
Color mode
root@mngc-001:~# chronyc sources -V
root@mngc-001:~# chronyc sources -V
Code Block. chrony daemon installation command

confirmation result

Color mode

MS Name/IP address         Stratum Poll Reach LastRx Last sample
===============================================================================
^+ 198.19.1.53                   4  10   377  1040    -16us[  -37us] +/- 9982us
^* 198.19.1.54                   4  10   377   312   -367us[ -388us] +/-   13ms

MS Name/IP address         Stratum Poll Reach LastRx Last sample
===============================================================================
^+ 198.19.1.53                   4  10   377  1040    -16us[  -37us] +/- 9982us
^* 198.19.1.54                   4  10   377   312   -367us[ -388us] +/-   13ms

Code block. chrony daemon installation check result

GPU MIG/ECC Setting Initialization Check Guide

When applying for a multi-node GPU cluster product, the GPU MIG/ECC setting is initialized. However, to apply the exact setting value, please restart it once at the beginning, and then check and use it according to the inspection guide to see if the setting value is applied.

Reference

MIG: Multi-Instance GPU
ECC: Error Correction Code

MIG Setup Initialization

Refer to the following for how to check and initialize MIG settings.

Use the following command to check if the status value of MIG M is Disabled.

command
Color mode
root@bm-dev-001:~#nvidia-smi
root@bm-dev-001:~#nvidia-smi
Code Block. MIG M. Initialize Settings

confirmation result

Color mode

+-----------------------------------------------------------------------------------------+
|  NVIDIA-SMI 470.129.06        Driver version: 470.129.06        CUDA Version: 11.4      |
|----------------------------------+-----------------------------+------------------------|
|  GPU  Name        Persistence-M  |  Bus-Id             Disp.A  |  Volatile Uncorr. ECC  |
|  Fan  Temp  Perf  Pwr:Usage/Cap  |               Memory-Usage  |  GPU-Util  Compute M.  |
|                                  |                             |                MIG M.  |
|==================================+=============================+========================|
|    0  NVIDIA A100-SXM...    Off  |  00000000:03:00.0      Off  |                   Off  |
|  N/A  29C     P0    57W  /  400W |          0MiB  /  81251MiB  |    0%         Default  |
|                                  |                             |              Disabled  |
+----------------------------------+-----------------------------+------------------------+
|    0  NVIDIA A100-SXM...    Off  |  00000000:0C:00.0      Off  |                   Off  |
|  N/A  30C     P0    58W  /  400W |          0MiB  /  81251MiB  |    18%        Default  |
|                                  |                             |              Disabled  |
+-----------------------------------------------------------------------------------------+

+-----------------------------------------------------------------------------------------+
|  NVIDIA-SMI 470.129.06        Driver version: 470.129.06        CUDA Version: 11.4      |
|----------------------------------+-----------------------------+------------------------|
|  GPU  Name        Persistence-M  |  Bus-Id             Disp.A  |  Volatile Uncorr. ECC  |
|  Fan  Temp  Perf  Pwr:Usage/Cap  |               Memory-Usage  |  GPU-Util  Compute M.  |
|                                  |                             |                MIG M.  |
|==================================+=============================+========================|
|    0  NVIDIA A100-SXM...    Off  |  00000000:03:00.0      Off  |                   Off  |
|  N/A  29C     P0    57W  /  400W |          0MiB  /  81251MiB  |    0%         Default  |
|                                  |                             |              Disabled  |
+----------------------------------+-----------------------------+------------------------+
|    0  NVIDIA A100-SXM...    Off  |  00000000:0C:00.0      Off  |                   Off  |
|  N/A  30C     P0    58W  /  400W |          0MiB  /  81251MiB  |    18%        Default  |
|                                  |                             |              Disabled  |
+-----------------------------------------------------------------------------------------+

Code Block. MIG M. Initialization Setting Check Result

If MIG M.’s status value is not Disabled, use the following command to initialize MIG.
Color mode
root@bm-dev-001:~# nvidia-smi -mig 0 root@bm-dev-001:~# nvidia-smi --gpu-reset
root@bm-dev-001:~# nvidia-smi -mig 0 root@bm-dev-001:~# nvidia-smi --gpu-reset
Code Block. MIG M. Status Value Initialization

ECC Setting Initialization

Refer to the following for how to check and initialize the ECC settings.

Use the following command to check if the status value of Volatile Uncorr. ECC is Off.

command
Color mode
root@bm-dev-001:~#nvidia-smi
root@bm-dev-001:~#nvidia-smi
Code Block. ECC Setting Command

confirmation result

Color mode

+-----------------------------------------------------------------------------------------+
|  NVIDIA-SMI 470.129.06        Driver version: 470.129.06        CUDA Version: 11.4      |
|----------------------------------+-----------------------------+------------------------|
|  GPU  Name        Persistence-M  |  Bus-Id             Disp.A  |  Volatile Uncorr. ECC  |
|  Fan  Temp  Perf  Pwr:Usage/Cap  |               Memory-Usage  |  GPU-Util  Compute M.  |
|                                  |                             |                MIG M.  |
|==================================+=============================+========================|
|    0  NVIDIA A100-SXM...    Off  |  00000000:03:00.0      Off  |                   Off  |
|  N/A  29C     P0    57W  /  400W |          0MiB  /  81251MiB  |    0%         Default  |
|                                  |                             |              Disabled  |
+----------------------------------+-----------------------------+------------------------+
|    0  NVIDIA A100-SXM...    Off  |  00000000:0C:00.0      Off  |                   Off  |
|  N/A  30C     P0    61W  /  400W |          0MiB  /  81251MiB  |    18%        Default  |
|                                  |                             |              Disabled  |
+-----------------------------------------------------------------------------------------+

+-----------------------------------------------------------------------------------------+
|  NVIDIA-SMI 470.129.06        Driver version: 470.129.06        CUDA Version: 11.4      |
|----------------------------------+-----------------------------+------------------------|
|  GPU  Name        Persistence-M  |  Bus-Id             Disp.A  |  Volatile Uncorr. ECC  |
|  Fan  Temp  Perf  Pwr:Usage/Cap  |               Memory-Usage  |  GPU-Util  Compute M.  |
|                                  |                             |                MIG M.  |
|==================================+=============================+========================|
|    0  NVIDIA A100-SXM...    Off  |  00000000:03:00.0      Off  |                   Off  |
|  N/A  29C     P0    57W  /  400W |          0MiB  /  81251MiB  |    0%         Default  |
|                                  |                             |              Disabled  |
+----------------------------------+-----------------------------+------------------------+
|    0  NVIDIA A100-SXM...    Off  |  00000000:0C:00.0      Off  |                   Off  |
|  N/A  30C     P0    61W  /  400W |          0MiB  /  81251MiB  |    18%        Default  |
|                                  |                             |              Disabled  |
+-----------------------------------------------------------------------------------------+

Code Block. ECC Setting Check Result

Volatile Uncorr. ECC’s status value is On*, please proceed with rebooting.
Volatile Uncorr. ECC status value is not On* or Off, use the following command to initialize ECC. After initialization, reboot and check if the status value is Off.

Color mode

root@bm-dev-001:~# nvidia-smi --ecc-config=0

root@bm-dev-001:~# nvidia-smi --ecc-config=0

Code Block. ECC Status Value Check

Multi-node GPU Cluster

1 - Overview

Service Overview

Provided Features

Component

GPU(H100)

OS and GPU Driver Version

Server Type

Preceding Service

1.1 - Server Type

Multi-node GPU Cluster Server Type

g2 Server Type

1.2 - Monitoring Metrics

Multi-node GPU Cluster monitoring metrics

Multi-node GPU Cluster [Cluster]

Multi-node GPU Cluster [Node]

2 - How-to guides

Multi-node GPU Cluster Getting Started

GPU Node Creation

GPU Node Check Detailed Information

Detailed Information

Tag

Work History

GPU Node Operation Control

GPU Node Getting Started

Stop GPU Node

GPU Node Cancel

2.1 - Cluster Fabric Management

Creating Cluster Fabric

Checking Cluster Fabric Details

Details

Bringing in Cluster Fabric Servers

Terminating Cluster Fabric

2.2 - ServiceWatch Agent Install

ServiceWatch Agent

2.3 - Multi-node GPU Cluster Service Scope and Inspection Guide

Multi-node GPU Cluster service scope

IaaS HW Inspection Guide

OS Kernel and Package Holding

Intel E810 Driver Update

NVIDIA driver check

IB bandwidth communication check

Check IB Service Related Kernel Modules

Storage Physical Disk Resources and Multi-Path Check

Multi-node GPU Cluster new deployment after checking Service Network

Multi-node GPU Cluster new deployment after checking Time Server and time synchronization

GPU MIG/ECC Setting Initialization Check Guide

MIG Setup Initialization

ECC Setting Initialization

3 - Release Note

Multi-node GPU Cluster