This is the multi-page printable view of this section. Click here to print.
Multi-node GPU Cluster
1 - Overview
Service Overview
Multi-node GPU Cluster is a service that provides physical GPU servers without virtualization for large-scale high-performance AI computation. You can use two or more Bare Metal Servers equipped with GPUs to cluster multiple GPUs, and conveniently operate GPU servers in conjunction with Samsung Cloud Platform’s high‑performance storage and networking services.
Provided Features
The Multi-node GPU Cluster provides the following features.
- Auto Provisioning and Management: Through the web-based Console, you can easily provision servers of the standard GPU Bare Metal model equipped with 8 GPUs and manage resources and costs.
- Network Connection: You can cluster multiple GPUs on two or more Bare Metal Servers via high‑speed interconnects, and by configuring a GPU Direct RDMA (Remote Direct Memory Access) environment, you can directly process data I/O between GPU memories, enabling high‑speed AI/Machine Learning computation.
- Storage Connection: Provides various additional attached storage besides the OS disk. * High-performance SSD NAS File Storage, Block Storage, and Object Storage directly integrated with a high-speed network can also be used together.
- Network Configuration Management: The server’s subnet/IP can be easily changed from the values set at initial creation. * NAT IP provides a management feature that allows you to enable or disable it as needed.
- Monitoring: You can view monitoring information for computing resources such as CPU, GPU, Memory, and Disk through Cloud Monitoring. * To use the Cloud Monitoring service of a Multi-node GPU Cluster, you need to install the Agent. * Please install the Agent to ensure stable service. * For more details, please refer to Multi-node GPU Cluster Monitoring Metrics.
- Terraform Provision: Provides an IaC environment via Terraform.
Component
Multi-node GPU Cluster provides GPUs as a Bare Metal Sever type with standard images and server types. NVSwitch and NVLink are provided.
Specifications by GPU Type
GPU (Graphic Processing Unit) is specialized for parallel operations that process large amounts of data quickly, enabling large-scale parallel computation in fields such as artificial intelligence (AI) and data analysis.
The following are the specifications of GPU types offered by the Multi-node GPU Cluster service.
| Category | H100 Type | B300 Type |
|---|---|---|
| GPU Architecture | NVIDIA Hopper | NVIDIA Blackwell Ultra |
| GPU Memory | 80 GiB | 268 GiB |
| GPU Transistors | 80 billion 4N TSMC | 208 billion 4NP TSMC |
| FP16 Tensor Core (Dense) | 989 TFLOPs | 2.25 PFLOPs |
| FP8 Tensor Core (Dense) | 1979 TFLOPs | 4.5 PFLOPs |
| FP4 Tensor Core (Dense) | Not supported | 13.5 PFLOPs |
| GPU Memory Bandwidth | 3,352 GB/s HBM3 | 8 TB/s HBM3e |
| NVLink performance | NVLink 4 | NVLink 5 |
| NVLink Signaling Rate | 25 GB/s (x18) | 50 GB/s (x18) |
| NVSwitch GPU-to-GPU bandwidth | 900 GB/s | 1.8 TB/s |
| Total NVSwitch aggregate bandwidth | 7.2 TB/s | 14.4 TB/s |
OS and GPU driver version
The operating systems (OS) supported by the Multi-node GPU Cluster are as follows.
| OS | OS version | GPU driver version |
|---|---|---|
| Ubuntu | 22.04 | 535.86.10, 535.183.06 |
| Ubuntu | 24.04 | 580.105.08 |
Server type
The format of server types provided by the Multi-node GPU Cluster is as follows.
- Example: when the server type is g2c96h8_metal
| Category | example | Detailed description |
|---|---|---|
| Server generation | g2 | Provided server generation
|
| CPU | c96 | Number of cores
|
| GPU | h8 | GPU type and quantity
|
Preceding Service
This is a list of services that must be pre-configured before creating the service. Please refer to the guide provided for each service and prepare in advance.
| Service Category | service | Detailed description |
|---|---|---|
| Networking | VPC | A service that provides an isolated virtual network in a cloud environment |
1.1 - Server type
Multi-node GPU Cluster server type
Multi-node GPU Cluster is categorized based on the GPU Type it provides, and the GPU used in a Multi-node GPU Cluster is determined by the server type selected when creating a GPU Node. Select the server type based on the specifications of the application you want to run on a multi-node GPU cluster.
The server types supported by the Multi-node GPU Cluster are as follows.
- Example: when the server type is g2c96h8_metal.
Category example Detailed description Server generation g2 Provided server generation - g2
- g means GPU server specification
- 2 means generation
CPU c96 Number of cores - c96: Allocated cores are physical cores
GPU h8 GPU type and quantity - h8: h means GPU type, and 8 means GPU quantity
Table. Multi-node GPU Cluster server type format - g2
g2 server type
The g2 server type is a GPU Bare Metal Server that uses NVIDIA H100 SXM GPUs, suitable for large-scale high-performance AI computation.
- 8 NVIDIA Hopper Architecture-based H100 GPUs provided
- Provides 1,979 TFLOPS FP8 Tensor Core performance per GPU, 989 TFLOPS FP16 Tensor Core performance.
- Supports up to 96 vCPUs and 2,048 GB of memory
- Supports up to 1,600 Gb/s NVIDIA InfiniBand RDMA network.
- Service network up to 100 Gbps
- 900 GB/s GPU P2P communication via NVSwitch within a node
| Server type | GPU | GPU Memory | CPU(Core) | Memory | Disk | GPU P2P |
|---|---|---|---|---|---|---|
| g2c96h8_metal | H100 | 640 GiB | 96 vCore | 2 TB | SSD (OS) 960 GB * 2, NVMeSSD 3.84 TB * 4 | 900 GB/s NVSwitch |
g3 server type
The g3 server type is a GPU Bare Metal Server that uses NVIDIA B300 SXM GPUs, suitable not only for large-scale high-performance AI computation but also for LLM inference and AI deployment for generative AI.
- 8 NVIDIA Blackwell Ultra Architecture-based B300 GPUs provided
- Provides 13.5 PFLOPS FP4 Tensor Core and 4.5 PFLOPS FP8 Tensor Core performance per GPU.
- Supports up to 128 vCPUs and 4,096 GB of memory
- Supports up to 6,400 Gb/s NVIDIA InfiniBand RDMA network
- Service network up to 100 Gbps
- 1.8 TB/s GPU P2P communication via NVSwitch within a node
| Server type | GPU | GPU Memory | CPU(Core) | Memory | Disk | GPU P2P |
|---|---|---|---|---|---|---|
| g3c128b8_metal | B300 | 2.1 TiB | 128 vCore | 4 TB | SSD (OS) 960 GB * 2, NVMeSSD 3.84 TB * 4 | 1.8 TB/s NVSwitch |
1.2 - Monitoring Metrics
According to Samsung Cloud Platform’s policy, the Cloud Monitoring service is scheduled to be discontinued in September 2026.
Accordingly, from after the September 2026 release, resource monitoring of the Samsung Cloud Platform via Cloud Monitoring will no longer be possible.
With a new alternative service, you can continuously perform resource monitoring by leveraging ServiceWatch released in October 2025.
ServiceWatch provides more modern and powerful features, replacing Cloud Monitoring to deliver a seamless monitoring environment.
If you are collecting metrics and logs through the Cloud Monitoring Agent, you need to switch to the ServiceWatch Agent.
For more details about ServiceWatch, please refer to ServiceWatch Overview.
Detailed information about ServiceWatch Agent: please refer to the ServiceWatch Agent
Multi-node GPU Cluster Monitoring Metrics
The table below shows the monitoring metrics of a Multi-node GPU Cluster that can be viewed through Cloud Monitoring.
Multi-node GPU Cluster [Cluster]
| Performance items | Detailed description | unit |
|---|---|---|
| Memory Total [Basic] | bytes of usable memory | bytes |
| Memory Used [Basic] | Current memory usage in bytes | bytes |
| Memory Swap In [Basic] | bytes of the replaced memory | bytes |
| Memory Swap Out [Basic] | bytes of the replaced memory | bytes |
| Memory Free [Basic] | bytes of unused memory | bytes |
| Disk Read Bytes [Basic] | Read bytes | bytes |
| Disk Read Requests [Basic] | Number of read requests | cnt |
| Disk Write Bytes [Basic] | write bytes | bytes |
| Disk Write Requests [Basic] | Number of write requests | cnt |
| CPU Usage [Basic] | Average system CPU usage over 1 minute | % |
| Instance State [Basic] | Instance status | state |
| Network In Bytes [Basic] | Received bytes | bytes |
| Network In Dropped [Basic] | Incoming packet drop | cnt |
| Network In Packets [Basic] | Number of received packets | cnt |
| Network Out Bytes [Basic] | sent bytes | bytes |
| Network Out Dropped [Basic] | Transmit packet drop | cnt |
| Network Out Packets [Basic] | Number of transmitted packets | cnt |
| Performance items | Detailed description | unit |
|---|---|---|
| Cluster GPU Count | GPU Count SUM in Cluster
| cnt |
| Cluster GPU Count In Use | Number of GPUs being used by jobs within the cluster
| cnt |
| Cluster GPU Usage | GPU Utilization AVG within the cluster
| % |
| Cluster GPU Memory Usage [Avg] | Cluster GPU Memory Utilization AVG
| % |
Multi-node GPU Cluster [Node]
| Performance items | Detailed description | unit |
|---|---|---|
| Memory Total [Basic] | bytes of usable memory | bytes |
| Memory Used [Basic] | Current memory usage in bytes | bytes |
| Memory Swap In [Basic] | bytes of the replaced memory | bytes |
| Memory Swap Out [Basic] | bytes of the replaced memory | bytes |
| Memory Free [Basic] | bytes of unused memory | bytes |
| Disk Read Bytes [Basic] | Read bytes | bytes |
| Disk Read Requests [Basic] | Number of read requests | cnt |
| Disk Write Bytes [Basic] | write bytes | bytes |
| Disk Write Requests [Basic] | Number of write requests | cnt |
| CPU Usage [Basic] | Average system CPU usage over 1 minute | % |
| Instance State [Basic] | Instance status | state |
| Network In Bytes [Basic] | Received bytes | bytes |
| Network In Dropped [Basic] | Incoming packet drop | cnt |
| Network In Packets [Basic] | Number of received packets | cnt |
| Network Out Bytes [Basic] | sent bytes | bytes |
| Network Out Dropped [Basic] | Transmit packet drop | cnt |
| Network Out Packets [Basic] | Number of transmitted packets | cnt |
| Performance items | Detailed description | unit |
|---|---|---|
| GPU Count | Number of GPUs | cnt |
| GPU Temperature | GPU temperature | ℃ |
| GPU Usage | utilization | % |
| GPU Usage [Avg] | Overall average GPU utilization (%) | % |
| GPU Power Cap | Maximum power capacity of the GPU | W |
| GPU Power Usage | Current GPU power usage | W |
| GPU Memory Usage [Avg] | GPU Memory Uti. AVG | % |
| GPU Count in use | Number of GPUs in use by jobs on the node | cnt |
| Execution Status for nvidia-smi | Result of running the nvidia-smi command | status |
| Core Usage [IO Wait] | Ratio of CPU time spent in wait state (disk wait) | % |
| Core Usage [System] | Proportion of CPU time spent in kernel space | % |
| Core Usage [User] | Proportion of CPU time spent in user space | % |
| CPU Cores | The number of CPU cores on the host. The maximum value of the unnormalized ratio is 100%* of a core. The unnormalized ratio already incorporates this value, and its maximum is 100%* of a core. | cnt |
| CPU Usage [Active] | Percentage of CPU time used excluding Idle and IOWait states (when all 4 cores are used at 100%: 400%) | % |
| CPU Usage [Idle] | It is the proportion of CPU time spent in idle state. | % |
| CPU Usage [IO Wait] | This is the proportion of CPU time spent in a waiting state (disk wait). | % |
| CPU Usage [System] | Percentage of CPU time used by the kernel (when all 4 cores are used at 100%: 400%) | % |
| CPU Usage [User] | Percentage of CPU time used in user space. (If all 4 cores are used at 100%, it is 400%) | % |
| CPU Usage/Core [Active] | Percentage of CPU time used excluding Idle and IOWait states (value normalized by the number of cores; 100% when all four cores are fully utilized) | % |
| CPU Usage/Core [Idle] | It is the proportion of CPU time spent in idle state. | % |
| CPU Usage/Core [IO Wait] | This is the proportion of CPU time spent in a waiting state (disk wait). | % |
| CPU Usage/Core [System] | Percentage of CPU time used by the kernel (value normalized by the number of cores; 100% when all 4 cores are fully utilized) | % |
| CPU Usage/Core [User] | Percentage of CPU time used in user space. (Value normalized by the number of cores; 100% when all 4 cores are fully utilized) | % |
| Disk CPU Usage [IO Request] | It is the proportion of CPU time during which I/O requests for the device were executed (device bandwidth utilization). If this value approaches 100%, the device becomes saturated. | % |
| Disk Queue Size [Avg] | The average queue length of requests executed for the device. | num |
| Disk Read Bytes | The number of bytes read per second from the device. | bytes |
| Disk Read Bytes [Delta Avg] | Average of system.diskio.read.bytes_delta for individual disks | bytes |
| Disk Read Bytes [Delta Max] | Maximum system.diskio.read.bytes_delta of individual disks | bytes |
| Disk Read Bytes [Delta Min] | Minimum system.diskio.read.bytes_delta of individual disks | bytes |
| Disk Read Bytes [Delta Sum] | Sum of the system.diskio.read.bytes_delta of individual disks | bytes |
| Disk Read Bytes [Delta] | Delta of the system.diskio.read.bytes value for each disk | bytes |
| Disk Read Bytes [Success] | Total number of bytes successfully read. On Linux, assuming a sector size of 512, it is the number of sectors read multiplied by 512. | bytes |
| Disk Read Requests | Number of read requests to the disk device per second | cnt |
| Disk Read Requests [Delta Avg] | Average of the system.diskio.read.count_delta for individual disks | cnt |
| Disk Read Requests [Delta Max] | Maximum system.diskio.read.count_delta for individual disks | cnt |
| Disk Read Requests [Delta Min] | Minimum of system.diskio.read.count_delta for individual disks | cnt |
| Disk Read Requests [Delta Sum] | Sum of system.diskio.read.count_delta of individual disks | cnt |
| Disk Read Requests [Success Delta] | Delta of system.diskio.read.count for each disk | cnt |
| Disk Read Requests [Success] | Total number of successful reads | cnt |
| Disk Request Size [Avg] | Average size of requests executed on the device (unit: sectors). | num |
| Disk Service Time [Avg] | Average service time (ms) of input requests executed on the device. | ms |
| Disk Wait Time [Avg] | Average time taken for requests executed on the supported device. | ms |
| Disk Wait Time [Read] | Average disk wait time | ms |
| Disk Wait Time [Write] | Average disk wait time | ms |
| Disk Write Bytes [Delta Avg] | Average of system.diskio.write.bytes_delta for each disk | bytes |
| Disk Write Bytes [Delta Max] | Maximum system.diskio.write.bytes_delta of individual disks | bytes |
| Disk Write Bytes [Delta Min] | Minimum of system.diskio.write.bytes_delta for individual disks | bytes |
| Disk Write Bytes [Delta Sum] | Sum of the system.diskio.write.bytes_delta of individual disks | bytes |
| Disk Write Bytes [Delta] | Delta of the system.diskio.write.bytes value for each disk | bytes |
| Disk Write Bytes [Success] | Total number of bytes successfully written. On Linux, assuming a sector size of 512, it is the number of sectors written multiplied by 512. | bytes |
| Disk Write Requests | Number of write requests to the disk device per second | cnt |
| Disk Write Requests [Delta Avg] | Average of system.diskio.write.count_delta for individual disks | cnt |
| Disk Write Requests [Delta Max] | Maximum system.diskio.write.count_delta for individual disks | cnt |
| Disk Write Requests [Delta Min] | Minimum of system.diskio.write.count_delta for individual disks | cnt |
| Disk Write Requests [Delta Sum] | Sum of the system.diskio.write.count_delta of individual disks | cnt |
| Disk Write Requests [Success Delta] | Delta of system.diskio.write.count for each disk | cnt |
| Disk Write Requests [Success] | Total number of successful writes | cnt |
| Disk Writes Bytes | It is the number of bytes per second written to the device. | bytes |
| Filesystem Hang Check | filesystem (local/NFS) hang check (normal:1, abnormal:0) | status |
| Filesystem Nodes | It is the total number of file nodes in the file system. | cnt |
| Filesystem Nodes [Free] | It is the total number of available file nodes in the file system. | cnt |
| Filesystem Size [Available] | Disk space (bytes) that unauthorized users can use. | bytes |
| Filesystem Size [Free] | Available disk space (bytes) | bytes |
| Filesystem Size [Total] | Total disk space (bytes) | bytes |
| Filesystem Usage | Used disk space percentage | % |
| Filesystem Usage [Avg] | Average of individual filesystem.used.pct | % |
| Filesystem Usage [Inode] | inode usage | % |
| Filesystem Usage [Max] | Maximum among individual filesystem.used.pct | % |
| Filesystem Usage [Min] | minimum of individual filesystem.used.pct | % |
| Filesystem Usage [Total] | - | % |
| Filesystem Used | Used disk space (bytes) | bytes |
| Filesystem Used [Inode] | inode usage | bytes |
| Memory Free | Total amount of available memory (bytes). Memory used by system cache and buffers is not included (see system.memory.actual.free). | bytes |
| Memory Free [Actual] | Actual usable memory (bytes). The calculation method varies by OS: on Linux, it is MemAvailable from /proc/meminfo, or if meminfo cannot be used, it is calculated from available memory plus cache and buffers. On OSX, it is the sum of usable memory and inactive memory. On Windows, it corresponds to a value such as system.memory.free. | bytes |
| Memory Free [Swap] | Available swap memory. | bytes |
| Memory Total | total memory | bytes |
| Memory Total [Swap] | Total swap memory. | bytes |
| Memory Usage | Percentage of used memory
| % |
| Memory Usage [Actual] | Percentage of memory actually used
| % |
| Memory Usage [Cache Swap] | Cached swap usage rate | % |
| Memory Usage [Swap] | Percentage of used swap memory | % |
| Memory Used | used memory | bytes |
| Memory Used [Actual] | Actual used memory (bytes). The value obtained by subtracting used memory from total memory. Available memory is calculated differently for each OS (see system.actual.free). | bytes |
| Memory Used [Swap] | Used swap memory. | bytes |
| Collisions | Network collision | cnt |
| Network In Bytes | Number of received bytes | bytes |
| Network In Bytes [Delta Avg] | Average of system.network.in.bytes_delta for each network | bytes |
| Network In Bytes [Delta Max] | Maximum of system.network.in.bytes_delta for each network | bytes |
| Network In Bytes [Delta Min] | Minimum system.network.in.bytes_delta for each network | bytes |
| Network In Bytes [Delta Sum] | Sum of system.network.in.bytes_delta for individual networks | bytes |
| Network In Bytes [Delta] | Delta of received byte count | bytes |
| Network In Dropped | Number of deleted packets among incoming packets | cnt |
| Network In Errors | Number of errors during reception | cnt |
| Network In Packets | Number of received packets | cnt |
| Network In Packets [Delta Avg] | Average of system.network.in.packets_delta for individual networks | cnt |
| Network In Packets [Delta Max] | Maximum of system.network.in.packets_delta for each network | cnt |
| Network In Packets [Delta Min] | Minimum of system.network.in.packets_delta for each network | cnt |
| Network In Packets [Delta Sum] | Sum of system.network.in.packets_delta for individual networks | cnt |
| Network In Packets [Delta] | Delta of received packet count | cnt |
| Network Out Bytes | Number of transmitted bytes | bytes |
| Network Out Bytes [Delta Avg] | Average of system.network.out.bytes_delta for each network | bytes |
| Network Out Bytes [Delta Max] | Maximum system.network.out.bytes_delta of individual networks | bytes |
| Network Out Bytes [Delta Min] | Minimum of system.network.out.bytes_delta for individual networks | bytes |
| Network Out Bytes [Delta Sum] | Sum of system.network.out.bytes_delta for individual networks | bytes |
| Network Out Bytes [Delta] | Delta of transmitted byte count | bytes |
| Network Out Dropped | Number of deleted packets among outgoing packets. This value is not reported by the operating system, so it is always 0 on Darwin and BSD. | cnt |
| Network Out Errors | Number of errors during transmission | cnt |
| Network Out Packets | Number of transmitted packets | cnt |
| Network Out Packets [Delta Avg] | Average of system.network.out.packets_delta for each network | cnt |
| Network Out Packets [Delta Max] | Maximum of system.network.out.packets_delta for each network | cnt |
| Network Out Packets [Delta Min] | Minimum of system.network.out.packets_delta for each network | cnt |
| Network Out Packets [Delta Sum] | Sum of system.network.out.packets_delta for individual networks | cnt |
| Network Out Packets [Delta] | Delta of transmitted packet count | cnt |
| Open Connections [TCP] | All open TCP connections | cnt |
| Open Connections [UDP] | All open UDP connections | cnt |
| Port Usage | Available port usage rate | % |
| SYN Sent Sockets | Number of sockets in SYN_SENT state (when connecting from local to remote) | cnt |
| Kernel PID Max | kernel.pid_max value | cnt |
| Kernel Thread Max | kernel.threads-max value | cnt |
| Process CPU Usage | Percentage of CPU time consumed by the process since the last update. This value is similar to the %CPU value shown for the process by the top command on Unix systems. | % |
| Process CPU Usage/Core | The percentage of CPU time used by the process since the last event. Normalized by the number of cores, with values ranging from 0 to 100%. | % |
| Process Memory Usage | Proportion of main memory (RAM) occupied by a process | % |
| Process Memory Used | Resident Set size. The amount of memory a process occupies in RAM. In Windows, the current working set size. | bytes |
| Process PID | process pid | PID |
| Process PPID | parent process PID | PID |
| Processes [Dead] | Number of dead processes | cnt |
| Processes [Idle] | Number of idle processes | cnt |
| Processes [Running] | Number of running processes | cnt |
| Processes [Sleeping] | Number of sleeping processes | cnt |
| Processes [Stopped] | stopped processes count | cnt |
| Processes [Total] | Total number of processes | cnt |
| Processes [Unknown] | Number of processes with an unknown or unsearchable status | cnt |
| Processes [Zombie] | Number of zombie processes | cnt |
| Running Process Usage | process usage | % |
| Running Processes | Number of running processes | cnt |
| Running Thread Usage | Thread usage rate | % |
| Running Threads | Total number of threads running in running processes | cnt |
| Instance Status | Instance status | state |
| Context Switches | context switch count (per second) | cnt |
| Load/Core [1 min] | The load over the last 1 minute divided by the number of cores | cnt |
| Load/Core [15 min] | The load over the last 15 minutes divided by the number of cores | cnt |
| Load/Core [5 min] | The load over the last 5 minutes divided by the number of cores | cnt |
| Multipaths [Active] | External storage connection path status = active count | cnt |
| Multipaths [Failed] | External storage connection path status = failed count | cnt |
| Multipaths [Faulty] | External storage connection path status = faulty count | cnt |
| NTP Offset | measured offset of the last sample (the time difference between the NTP server and the local environment) | num |
| Run Queue Length | Execution queue length | num |
| Uptime | OS uptime(uptime). (milliseconds) | ms |
| Context Switchies | CPU context switch count (per second) | cnt |
| Disk Read Bytes [Sec] | Number of bytes read from a Windows logical disk in 1 second | cnt |
| Disk Read Time [Avg] | Average data read time (seconds) | sec |
| Disk Transfer Time [Avg] | Disk average wait time | sec |
| Disk Usage | Disk usage | % |
| Disk Write Bytes [Sec] | Number of bytes written in one second on a Windows logical disk | cnt |
| Disk Write Time [Avg] | Average data write time (seconds) | sec |
| Pagingfile Usage | Paging file usage | % |
| Pool Used [Non Paged] | Nonpaged Pool usage in kernel memory | bytes |
| Pool Used [Paged] | Paged Pool usage in kernel memory | bytes |
| Process [Running] | Number of currently running processes | cnt |
| Threads [Running] | Number of currently running threads | cnt |
| Threads [Waiting] | Number of threads waiting for processor time | cnt |
2 - How-to guides
Users can create the service by entering the required information for the Multi-node GPU Cluster service and selecting detailed options through the Samsung Cloud Platform Console.
Multi-node GPU Cluster Getting Started
You can create and use a Multi-node GPU Cluster service in the Samsung Cloud Platform Console.
This service consists of a GPU Node and a Cluster Fabric service.
Create GPU Node
Multi-node GPU Cluster To create a Multi-node GPU Cluster, follow the steps below.
- All Services > Compute > Multi-node GPU Cluster menu, click it. Navigate to the Service Home page of Multi-node GPU Cluster.
- On the Service Home page, click the Create GPU Node button. You will be taken to the Create GPU Node page.
- On the GPU Node Creation page, enter the information required to create the service and select detailed options.
- Select the required information in the Image and Version Selection area.
Category required or notDetailed description image Required Select the type of image provided - Ubuntu
Image version Required Select version of the chosen image - Provide a list of versions for the supplied server image
Table. GPU Node image and version selection options - In the Service Information Input area, enter or select the required information.
Category RequiredDetailed description Number of servers Required Number of GPU Node servers to create simultaneously - Only numeric input is allowed, and the minimum number of servers to create is 2.
- Only during the initial setup can you create 2 or more, and subsequent expansions can be done one at a time.
Service Type > Server Type Required GPU Node server type - Select the desired CPU, Memory, GPU, and Disk specifications
- For detailed information about the server types provided by GPU Node, refer to Multi-node GPU Cluster Server Type
Service Type > Planned Compute Required Status of resources with Planned Compute configured - In Use: Number of resources with Planned Compute that are currently in use
- Configured: Number of resources with Planned Compute configured
- Coverage Preview: Amount applied per resource by Planned Compute
- Apply for Planned Compute Service: Navigate to the Planned Compute service application page
- For more details, refer to Apply for Planned Compute
Table. GPU Node Service Information Input Items - In the Required Information Input area, enter or select the necessary information.
Category RequiredDetailed description Administrator account Required Set the administrator account and password to be used when connecting to the server - Ubuntu OS is provided with root fixed
Server name Prefix Required Enter a Prefix to distinguish each GPU Node generated when the selected number of servers is 2 or more - Automatically generated in the form of user input value (prefix) + ‘
-###’
- Must start with a lowercase English letter and be entered using lowercase letters, numbers, and special characters (
-) within 3 to 11 characters
- Must not end with a special character (
-)
Network Settings Required Set the network where the GPU Node will be installed - VPC name: select a pre‑created VPC
- General Subnet name: select a pre‑created general Subnet
- IP can be auto‑generated or manually entered; if manual input is chosen, the user enters the IP directly
- NAT: usable only when there is a single server and the VPC is attached to an Internet Gateway. Check to use, and you can select a NAT IP (initially it can be created only with two or more servers, so modify it on the resource detail page)
- NAT IP: select a NAT IP
- If no NAT IP is available, click the Create new button to generate a Public IP
- Click the Refresh button to view and select the created Public IP
- Creating a Public IP incurs charges according to the Public IP pricing
Table. Required input fields for GPU Node - In the Cluster Selection area, create or select a Cluster Fabric.
Category Required statusDetailed description Cluster Fabric Required Configure a group of GPU Node servers that can apply GPU Direct RDMA together - Optimal GPU performance and speed can be achieved only within the same Cluster Fabric
- When creating a new Cluster Fabric, *New Input > select Node pool, then enter the name of the Cluster Fabric to create
- To add to an existing Cluster Fabric, Existing Input > select Node pool, then select the previously created Cluster Fabric
Table. GPU Node Cluster Fabric selection items - Additional Information Input area, enter or select the required information.
Category Required statusDetailed description Lock Selection Using a lock prevents accidental actions that could terminate, start, or stop the server. Init Script Selection Script to run when the server starts - The Init Script must be selected differently depending on the image type
- For Linux: Choose Shell Script or cloud-init
tag Selection Add Tag - Up to 50 can be added per resource
- After clicking the Add Tag button, enter or select Key, Value values
Table. GPU Node additional information input fields - The Init Script must be selected differently depending on the image type
- Select the required information in the Image and Version Selection area.
- Summary Check the detailed information and estimated billing amount generated in the panel, and click the Create button.
- When creation is complete, check the created resources on the GPU Node List page.
- When creating a service, the GPU MIG/ECC settings are reset. However, to ensure the correct settings are applied, perform an initial reboot, verify that the settings have taken effect, and then proceed.
- For detailed information on resetting GPU MIG/ECC settings, refer to the GPU MIG/ECC Reset Settings Checklist Guide.
Check GPU Node details
The Multi-node GPU Cluster service allows you to view and edit the full resource list and detailed information of GPU nodes.
GPU Node Details page consists of Details, Tags, Job History tabs.
To view detailed information about the GPU Node, follow these steps.
All Services > Compute > Multi-node GPU Cluster > GPU Node menu, click it. Navigate to the Service Home page of Multi-node GPU Cluster.
On the Service Home page, click the GPU Node menu. You will be taken to the GPU Node List page.
- Resource items beyond the required columns can be added via the Settings button.
Category Required statusDetailed description Resource ID Selection User-created GPU Node ID Cluster Fabric name Required User-created Cluster Fabric name Server name Required User-created GPU Node name Server type Required Server type of the GPU Node - Users can view the number of cores, memory capacity, and GPU type and quantity of the resources they created
image Required User-created GPU Node image version IP Required IP of the GPU node created by the user status Required Status of the user-created GPU node Creation date and time Selection GPU Node creation timestamp Table. GPU Node resource list items
- Resource items beyond the required columns can be added via the Settings button.
On the GPU Node List page, click the resource for which you want to view detailed information. You will be taken to the GPU Node Details page.
- GPU Server Details At the top of the page, status information and descriptions of additional features are displayed.
Category Detailed description GPU Node status Status of user-created GPU Node - Creating: server is being created
- Running:: creation completed and available
- Editing:: IP is being changed
- Unknown: error state
- Starting: server is starting
- Stopping: server is stopping
- Stopped: server stopped
- Terminating: terminating
- Terminated: termination completed
Server control Button to change server status - Start: Start a stopped server
- Stop: Stop a running server
Service termination Cancel service button Table. GPU Node status information and additional features
- GPU Server Details At the top of the page, status information and descriptions of additional features are displayed.
Detailed Information
On the GPU Node List page’s Details Tab, you can view the detailed information of the selected resource and edit the information if necessary.
| Category | Detailed description |
|---|---|
| service | Service name |
| Resource Type | Resource Type |
| SRN | Unique resource ID in Samsung Cloud Platform
|
| Resource name | Resource Name
|
| Resource ID | Unique resource ID in the service |
| constructor | User who created the service |
| Creation date and time | Service creation date and time |
| editor | User who edited the service information |
| Modification date | Date and time the service information was modified |
| Server name | Server name |
| Node pool | A collection of nodes that can be grouped into the same Cluster Fabric |
| Cluster Fabric name | User-created Cluster Fabric name |
| Image/Version | Server OS image and version |
| Server type | CPU, memory, GPU, information display |
| Planned Compute | Resource status with Planned Compute configured
|
| Lock | Display whether Lock is enabled or disabled
|
| Network | GPU Node network information
|
| Block Storage | Block Storage information attached to the server
|
| Init Script | View the Init Script content entered when creating the server |
tag
On the GPU Node List page’s Tag Tab, you can view the selected resource’s tag information, and add, modify, or delete it.
| Category | Detailed description |
|---|---|
| Tag list | Tag list
|
Job History
On the GPU Node List page’s Job History Tab, you can view the job history of the selected resource.
| Category | Detailed description |
|---|---|
| Task History List | Resource Change History
|
Control GPU Node Operation
If you need server control and management functions for the created GPU Node resources, you can perform tasks on the GPU Node List or GPU Node Details page. You can start and stop the resources of a running GPU node.
Getting Started with GPU Node
You can start a GPU Node that is stopped (Stopped). To start a GPU Node, follow the steps below.
- Click the All Services > Compute > Multi-node GPU Cluster menu. Go to the Service Home page of Multi-node GPU Cluster.
- On the Service Home page, click the GPU Node menu. You will be taken to the GPU Node List page.
- On the GPU Node List page, after selecting individual or multiple servers with the checkboxes, you can Start using the More button at the top.
- On the GPU Node List page, click the resource. Navigate to the GPU Node Detail page.
- On the GPU Node Details page, click the Start button at the top to start the server.
- Check the server status and complete the status change.
Stopping GPU Node
You can stop a GPU Node that is running (Active). To stop a GPU Node, follow the steps below.
- Click the All Services > Compute > Multi-node GPU Cluster menu. Go to the Service Home page of Multi-node GPU Cluster.
- On the Service Home page, click the GPU Node menu. You will be taken to the GPU Node List page.
- GPU Node List page allows you to control individual or multiple servers by selecting the checkboxes and using the Stop button at the top.
- On the GPU Node List page, click the resource. You will be taken to the GPU Node Details page.
- On the GPU Node Details page, click the Stop button at the top to stop the server.
- Check the server status and complete the status change.
Terminate GPU Node
You can terminate unused GPU Nodes to reduce operating costs. However, terminating a service may cause the running service to stop immediately, so you should carefully consider the impact of service interruption before proceeding with termination.
To cancel a GPU Node, follow the steps below.
- Click the All Services > Compute > Multi-node GPU Server menu. Navigate to the Service Home page of the Multi-node GPU Cluster.
- On the Service Home page, click the Cluster Fabric menu. You will be taken to the Cluster Fabric list page.
- Cluster Fabric List page, select the resources to terminate, and click the Service Termination button.
- Resources that use the same Cluster Fabric can be terminated simultaneously.
- When termination is complete, check on the GPU Node List page whether the resources have been terminated.
The cases where a GPU Node cannot be terminated are as follows.
- When Block Storage(BM) is connected: Please disconnect the Block Storage(BM) connection first.
- If File Storage is connected: Please disconnect the File Storage first.
- When Lock is set: Please change the Lock setting to disabled and try again.
- If the selection includes a server that cannot be terminated simultaneously: Please re-select only resources that can be terminated.
- If the server you want to terminate has a different Cluster Fabric: Select only resources that use the same Cluster Fabric.
2.1 - Manage Cluster Fabric
Cluster Fabric is a service that helps manage the servers (GPU Node) included in a GPU Cluster. By using Cluster Fabric, you can move servers between GPU Clusters in the same Node pool and optimize GPU performance and speed within the same GPU Cluster.
Creating Cluster Fabric
Cluster Fabric can be created together with a GPU Node, and it cannot be created or deleted separately. If all GPU Nodes within a Cluster Fabric are terminated, the Cluster Fabric is automatically deleted.
If you have not created a GPU Node, please create a GPU Node first. For more information, see GPU Node 생성하기.
Check Cluster Fabric details
- Cluster Fabric can be created together when a GPU node is created, and it cannot be created or deleted independently.
- If all GPU nodes in the Cluster Fabric are terminated, the Cluster Fabric is automatically deleted.
- If you have not created a GPU Node, please create a GPU Node first. For more details, refer to GPU Node 생성하기.
On the Cluster Fabric List page and the Cluster Fabric Details page, you can view the generated Cluster Fabric list and details and move the server.
Click the All Services > Compute > Multi-node GPU Server menu. Go to the Service Home page of the Multi-node GPU Cluster.
On the Service Home page, click the Cluster Fabric menu. You will be taken to the Cluster Fabric List page.
- On the Cluster Fabric List page, you can view the resource list of GPU clusters created by the user.
- Resource items beyond the required columns can be added via the Settings button.
Category RequiredDetailed description Resource ID Selection User-created Cluster Fabric ID Cluster Fabric name Required User-created Cluster Fabric name Node pool Selection A collection of nodes that can be grouped into the same Cluster Fabric Number of servers Selection Number of GPU Nodes Server type Selection Server type of GPU Node - Users can view the number of cores, memory capacity, and GPU type and count of the resources they created
status Selection Status of the user-created Cluster Fabric Creation date and time Select Cluster Fabric creation timestamp Table. Cluster Fabric resource list items
On the Cluster Fabric List page, click the resource to view detailed information. You will be taken to the Cluster Fabric Details page.
- Cluster Fabric Details At the top of the page, status information and descriptions of additional features are displayed.
Category Detailed description Cluster Fabric status Status of the user-created Cluster Fabric - Creating: State while the cluster is being created
- Active: State when creation is complete and the cluster is usable
- Editing: State while the IP is being changed
- Deleting: State while being terminated
- Deleted: State after termination is complete
Add target server A feature that allows moving a server from another cluster to the target cluster. Table. Cluster Fabric status information and additional features
- Cluster Fabric Details At the top of the page, status information and descriptions of additional features are displayed.
Detailed Information
On the Cluster Fabric List page’s Details Tab, you can view detailed information of the selected resource and retrieve servers from another cluster.
| Category | Detailed description |
|---|---|
| service | Service name |
| Resource Type | Resource Type |
| SRN | Unique resource ID in Samsung Cloud Platform
|
| Resource name | Resource name
|
| Resource ID | Unique resource ID in the service |
| constructor | User who created the service |
| Creation date and time | Service creation date and time |
| editor | User who edited the service information |
| Modification date | Date and time the service information was modified |
| Cluster Fabric name | User-created Cluster Fabric name |
| Node pool | A set of nodes that can be grouped into the same Cluster Fabric |
| target server | GPU Node list bound to Cluster Fabric
|
Import Cluster Fabric Server
Cluster Fabric Details page’s add target server feature allows you to import servers from another cluster and add them to the selected cluster.
- Click the All Services > Compute > Multi-node GPU Server menu. Navigate to the Service Home page of the Multi-node GPU Cluster.
- On the Service Home page, click the Cluster Fabric menu. You will be taken to the Cluster Fabric list page.
- On the Cluster Fabric List page, click the resource to view detailed information. You will be taken to the Cluster Fabric Details page.
- In the target server of the Details tab, click the Add button on the right.
- The add target server popup window opens.
- Select a cluster from Cluster Fabric.
- GPU nodes associated with the selected cluster are listed; select the GPU node you want to retrieve.
- The selected GPU Node’s name is displayed at the bottom.
- Press the Confirm button to complete.
- Pressing the Cancel button cancels the operation.
- Verify that the GPU node added on the target server is displayed.
- The add target server popup window opens.
Terminate Cluster Fabric
If all GPU Nodes in the Cluster Fabric are terminated, the Cluster Fabric is automatically deleted. For more information, see Terminate GPU Node.
2.2 - Install ServiceWatch Agent
Users can install the ServiceWatch Agent on GPU nodes of a Multi-node GPU Cluster to collect custom metrics and logs.
ServiceWatch Agent
In a Multi-node GPU Cluster, the agents that need to be installed on GPU nodes to collect ServiceWatch custom metrics and logs can be divided into two main types. It is a Prometheus Exporter and Open Telemetry Collector.
| Category | Detailed description | |
|---|---|---|
| Prometheus Exporter | Provide metrics of a specific application or service in a format that Prometheus can scrape
| |
| Open Telemetry Collector | Acts as a centralized collector that gathers telemetry data such as metrics and logs from distributed systems, processes (filtering, sampling, etc.) it, and exports it to multiple backends (e.g., Prometheus, Jaeger, Elasticsearch, etc.)
|
If you have configured a Kubernetes Engine on a GPU node, please view the GPU metrics using the metrics provided by the Kubernetes Engine.
- If you install the DCGM Exporter on a GPU node where Kubernetes Engine is configured, it may not operate correctly.
Pre-configuration for Using ServiceWatch Agent
To use the ServiceWatch Agent, please refer to Prerequisite Settings for ServiceWatch Agent and prepare the prerequisite settings.
2.3 - Multi-node GPU Cluster Service Scope and Inspection Guide
Multi-node GPU Cluster Service Scope
If an IaaS hardware-level issue occurs with the Multi-node GPU Cluster service, you can receive technical support through Contact Us in the Support Center. However, the risks associated with changes such as OS kernel updates or application installations are the user’s responsibility, so technical support is limited; please be mindful when performing tasks such as system updates.
IaaS hardware level issue
- HW fault event messages generated within the server by the IPMI hardware monitoring console.
- GPU HW operation error observed in the nvdia-smi command
- HW error messages that occur during inspection of InfiniBand HCA cards or InfiniBand Switches
IaaS HW Inspection Guide
After applying for the Multi-node GPU Cluster service, it is recommended to check the IaaS HW level according to the inspection guide.
Intel E810 driver update
Check the version of the Intel E810 driver and, if necessary, refer to the following procedure to perform an update.
- Server manufacturer Intel E810 driver minimum recommended version: 1.15.4 or later
- Driver download: Intel Network Adapter Driver for 800 Series Devices under Linux
You can use the lspci command to verify whether an E810 NIC device is present.
If the E810 NIC is valid, the PCIe device is identified as follows using the E810-C information (if there is no E810 NIC device, this operation is not performed).
<div class="code-block-buttons">
<button class="code-block-download">
<div class="code-block-download-icon">
<svg width="14" height="16" viewBox="0 0 14 16" fill="none" xmlns="http://www.w3.org/2000/svg"><path d="M13.999 14.4353v-2.998C13.999 11.0232 13.6602 10.6853 13.2461 10.6853S12.5 11.0232 12.5 11.4373v2.2519H1.5V11.4373C1.5 11.0232 1.16211 10.6853.748047 10.6853.333984 10.6853 976563e-9 11.0232 976562e-9 11.4373v2.998C976562e-9 14.8494.333984 15.1892.748047 15.1892H13.2461c.414099999999999.0.7529-.3398.7529-.7539z" fill="#5135ff"/><path d="M1.41169 6.21654c.25876-.32345.73073-.37589 1.05417-.11713l3.78463 3.0277V1.56104c0-.41422.33578-.750005.75-.750005.41421.0.75.335785.75.750005V9.12208L11.5288 6.09941C11.8523 5.84065 12.3242 5.89309 12.583 6.21654 12.8418 6.53999 12.7893 7.01196 12.4659 7.27071L6.99734 11.6455 1.52882 7.27071c-.32345-.25875-.37589-.73072-.11713-1.05417z" fill="#5135ff"/></svg></div>
Sample Code Download
</button>
<button class="code-block-copy">
<div class="code-block-copy-icon">
<svg width="14" height="14" viewBox="0 0 14 14" fill="none" xmlns="http://www.w3.org/2000/svg"><path fill-rule="evenodd" clip-rule="evenodd" d="M10 4.00012V1c0-.552285-.44772-1-1-1H1C.447715.0.0.447715.0 1V9c0 .55228.447715 1 1 1H3.99988v3.0001C3.99988 13.5515 4.44849 14 4.99988 14H10.0002l3.9997-3.9995V5c0-.55139-.448600000000001-.99988-1-.99988H10zM1.4 1.4V8.6H3.99988V5c0-.55139.44861-.99988 1-.99988H8.6V1.4H1.4zM5.3999 12.6H9.08295V9.78301C9.08295 9.39641 9.39635 9.08301 9.78295 9.08301H12.5999V5.40015h-7.2V12.6zm6.1375-2.117-1.0545 1.0544V10.483h1.0545z" fill="currentcolor"/></svg></div>
Copy Code
</button>
</div>
lspci | grep E810
0000:6a:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-C for QSFP (rev 02) 0000:6a:00.1 Ethernet controller: Intel Corporation Ethernet Controller E810-C for QSFP (rev 02)lspci | grep E810
0000:6a:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-C for QSFP (rev 02) 0000:6a:00.1 Ethernet controller: Intel Corporation Ethernet Controller E810-C for QSFP (rev 02)</div>
To update the driver, follow the steps below.
Move the base driver tar file to the desired directory.
Color mode/usr/local/src/usr/local/srcDirectory navigation example Untar / unzip the Archiver file.
- x.x.x is the version number of the driver tar file.Color mode
tar zxf ice-x.x.x.tar.gztar zxf ice-x.x.x.tar.gzuntar example
- x.x.x is the version number of the driver tar file.
Change the driver to the src directory.
- x.x.x is the version number of the driver tar file.Color mode
cd ice-x.x.x/src/cd ice-x.x.x/src/Example of changing to the src directory
- x.x.x is the version number of the driver tar file.
Compile the driver module.
Color modemake installmake installDriver module compilation example After the update is complete, check the version.
Color modelsmod | grep ice modinfo ice | grep versionlsmod | grep ice modinfo ice | grep versionVersion check example
Check NVIDIA driver
To check the NVIDIA driver (nvidia-smi topo, IB nv_peer_mem status) and inspect the IaaS hardware level, follow these steps.
Check the GPU driver status.
Color mode~$ nvidia-smi~$ nvidia-smiExample code for checking GPU driver status Color modeThu Jan 29 14:48:31 2026 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.183.06 Driver Version: 535.183.06 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | | =========================================+======================+====================== | | 0 NVIDIA A100-SXM4-80GB On | 00000000:00:05.0 Off | On | | N/A 36C P0 52W / 400W | 0MiB / 81920MiB | N/A Default | | | | Enabled | +-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA A100-SXM4-80GB On | 00000000:00:06.0 Off | 0 | | N/A 36C P0 61W / 400W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 2 NVIDIA A100-SXM4-80GB On | 00000000:00:07.0 Off | 0 | | N/A 36C P0 64W / 400W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 3 NVIDIA A100-SXM4-80GB On | 00000000:00:08.0 Off | 0 | | N/A 40C P0 64W / 400W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 4 NVIDIA A100-SXM4-80GB On | 00000000:00:09.0 Off | 0 | | N/A 36C P0 63W / 400W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 5 NVIDIA A100-SXM4-80GB On | 00000000:00:0A.0 Off | 0 | | N/A 40C P0 64W / 400W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 6 NVIDIA A100-SXM4-80GB On | 00000000:00:0B.0 Off | 0 | | N/A 39C P0 65W / 400W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 7 NVIDIA A100-SXM4-80GB On | 00000000:00:0C.0 Off | 0 | | N/A 39C P0 60W / 400W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | MIG devices: | +------------------+--------------------------------+-----------+-----------------------+ | GPU GI CI MIG | Memory-Usage | Vol | Shared | | ID ID Dev | BAR1-Usage | SM Unc | CE ENC DEC OFA JPG | | | | ECC | | | ==================+================================+===========+======================= | | No MIG devices found | +---------------------------------------------------------------------------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | | ======================================================================================= | | No running processes found | +---------------------------------------------------------------------------------------+Thu Jan 29 14:48:31 2026 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.183.06 Driver Version: 535.183.06 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | | =========================================+======================+====================== | | 0 NVIDIA A100-SXM4-80GB On | 00000000:00:05.0 Off | On | | N/A 36C P0 52W / 400W | 0MiB / 81920MiB | N/A Default | | | | Enabled | +-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA A100-SXM4-80GB On | 00000000:00:06.0 Off | 0 | | N/A 36C P0 61W / 400W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 2 NVIDIA A100-SXM4-80GB On | 00000000:00:07.0 Off | 0 | | N/A 36C P0 64W / 400W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 3 NVIDIA A100-SXM4-80GB On | 00000000:00:08.0 Off | 0 | | N/A 40C P0 64W / 400W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 4 NVIDIA A100-SXM4-80GB On | 00000000:00:09.0 Off | 0 | | N/A 36C P0 63W / 400W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 5 NVIDIA A100-SXM4-80GB On | 00000000:00:0A.0 Off | 0 | | N/A 40C P0 64W / 400W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 6 NVIDIA A100-SXM4-80GB On | 00000000:00:0B.0 Off | 0 | | N/A 39C P0 65W / 400W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 7 NVIDIA A100-SXM4-80GB On | 00000000:00:0C.0 Off | 0 | | N/A 39C P0 60W / 400W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | MIG devices: | +------------------+--------------------------------+-----------+-----------------------+ | GPU GI CI MIG | Memory-Usage | Vol | Shared | | ID ID Dev | BAR1-Usage | SM Unc | CE ENC DEC OFA JPG | | | | ECC | | | ==================+================================+===========+======================= | | No MIG devices found | +---------------------------------------------------------------------------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | | ======================================================================================= | | No running processes found | +---------------------------------------------------------------------------------------+GPU driver status example Check the NVSwitch and NVLink hardware status.
Check NVSwitch status
Color mode~$ nvidia-smi nvlink --status~$ nvidia-smi nvlink --statusNVSwitch status check example Color modeGPU 1: NVIDIA A100-SXM4-80GB (UUID: GPU-64a2f685-bb12-c4af-105c-0726ece9c8d7) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 2: NVIDIA A100-SXM4-80GB (UUID: GPU-2269851b-71cd-f6c7-50c5-ba1525cf3ce8) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 3: NVIDIA A100-SXM4-80GB (UUID: GPU-4c397bbf-95fc-5c29-918a-a429cbe45a7a) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 4: NVIDIA A100-SXM4-80GB (UUID: GPU-0e350204-9fb6-2cbe-538e-8f7849658eb8) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 5: NVIDIA A100-SXM4-80GB (UUID: GPU-45f0c453-4760-edd4-3af9-25c5ea7473a5) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 6: NVIDIA A100-SXM4-80GB (UUID: GPU-38409794-bb34-430e-3c50-90b42cb2bb72) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 7: NVIDIA A100-SXM4-80GB (UUID: GPU-3fb478aa-801b-eb64-55c2-0ffc3f2ce404) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/sGPU 1: NVIDIA A100-SXM4-80GB (UUID: GPU-64a2f685-bb12-c4af-105c-0726ece9c8d7) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 2: NVIDIA A100-SXM4-80GB (UUID: GPU-2269851b-71cd-f6c7-50c5-ba1525cf3ce8) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 3: NVIDIA A100-SXM4-80GB (UUID: GPU-4c397bbf-95fc-5c29-918a-a429cbe45a7a) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 4: NVIDIA A100-SXM4-80GB (UUID: GPU-0e350204-9fb6-2cbe-538e-8f7849658eb8) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 5: NVIDIA A100-SXM4-80GB (UUID: GPU-45f0c453-4760-edd4-3af9-25c5ea7473a5) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 6: NVIDIA A100-SXM4-80GB (UUID: GPU-38409794-bb34-430e-3c50-90b42cb2bb72) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 7: NVIDIA A100-SXM4-80GB (UUID: GPU-3fb478aa-801b-eb64-55c2-0ffc3f2ce404) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/sNVSwitch status example Check NVLinks hardware status
Color mode~$ nvidia-smi topo -m~$ nvidia-smi topo -mExample code for checking NVLink hardware status Color modeGPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X NV12 NV12 NV12 NV12 NV12 NV12 NV12 0-127 0-7 N/A GPU1 NV12 X NV12 NV12 NV12 NV12 NV12 NV12 0-127 0-7 N/A GPU2 NV12 NV12 X NV12 NV12 NV12 NV12 NV12 0-127 0-7 N/A GPU3 NV12 NV12 NV12 X NV12 NV12 NV12 NV12 0-127 0-7 N/A GPU4 NV12 NV12 NV12 NV12 X NV12 NV12 NV12 0-127 0-7 N/A GPU5 NV12 NV12 NV12 NV12 NV12 X NV12 NV12 0-127 0-7 N/A GPU6 NV12 NV12 NV12 NV12 NV12 NV12 X NV12 0-127 0-7 N/A GPU7 NV12 NV12 NV12 NV12 NV12 NV12 NV12 X 0-127 0-7 N/A Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinksGPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X NV12 NV12 NV12 NV12 NV12 NV12 NV12 0-127 0-7 N/A GPU1 NV12 X NV12 NV12 NV12 NV12 NV12 NV12 0-127 0-7 N/A GPU2 NV12 NV12 X NV12 NV12 NV12 NV12 NV12 0-127 0-7 N/A GPU3 NV12 NV12 NV12 X NV12 NV12 NV12 NV12 0-127 0-7 N/A GPU4 NV12 NV12 NV12 NV12 X NV12 NV12 NV12 0-127 0-7 N/A GPU5 NV12 NV12 NV12 NV12 NV12 X NV12 NV12 0-127 0-7 N/A GPU6 NV12 NV12 NV12 NV12 NV12 NV12 X NV12 0-127 0-7 N/A GPU7 NV12 NV12 NV12 NV12 NV12 NV12 NV12 X 0-127 0-7 N/A Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinksNVLink HW status check code example
Check the InfiniBand (IB) HCA card hardware status and link.
Color modeuser@bm-dev-001:~$ ibdev2netdev -vuser@bm-dev-001:~$ ibdev2netdev -vHW status check command example Color modecat: /sys/class/infiniband/mlx5_0/device/vpd: Permission denied 0000:45:00.0 mlx5_0 (MT4123 - ) fw 20.29.1016 port 1 (ACTIVE) ==> ibs18 (Down) cat: /sys/class/infiniband/mlx5_1/device/vpd: Permission denied 0000:0e:00.0 mlx5_1 (MT4123 - ) fw 20.29.1016 port 1 (ACTIVE) ==> ibs17 (Down) cat: /sys/class/infiniband/mlx5_2/device/vpd: Permission denied 0000:c5:00.0 mlx5_2 (MT4123 - ) fw 20.29.1016 port 1 (ACTIVE) ==> ibs20 (Down) cat: /sys/class/infiniband/mlx5_3/device/vpd: Permission denied 0000:85:00.0 mlx5_3 (MT4123 - ) fw 20.29.1016 port 1 (ACTIVE) ==> ibs19 (Down) user@bm-dev-001:~$cat: /sys/class/infiniband/mlx5_0/device/vpd: Permission denied 0000:45:00.0 mlx5_0 (MT4123 - ) fw 20.29.1016 port 1 (ACTIVE) ==> ibs18 (Down) cat: /sys/class/infiniband/mlx5_1/device/vpd: Permission denied 0000:0e:00.0 mlx5_1 (MT4123 - ) fw 20.29.1016 port 1 (ACTIVE) ==> ibs17 (Down) cat: /sys/class/infiniband/mlx5_2/device/vpd: Permission denied 0000:c5:00.0 mlx5_2 (MT4123 - ) fw 20.29.1016 port 1 (ACTIVE) ==> ibs20 (Down) cat: /sys/class/infiniband/mlx5_3/device/vpd: Permission denied 0000:85:00.0 mlx5_3 (MT4123 - ) fw 20.29.1016 port 1 (ACTIVE) ==> ibs19 (Down) user@bm-dev-001:~$Example of HW status check result Color moderoot@bm-dev-001:~# ibstatroot@bm-dev-001:~# ibstatExample of link verification command Color modeCA 'mlx5_0' CA type: MT4123 Number of ports: 1 Firmware version: 20.29.1016 Hardware version: 0 Node GUID: 0x88e9a4ffff5060ac System image GUID: 0x88e9a4ffff5060ac Port 1: State: Active Physical state: LinkUp Rate: 200 Base lid: 8 LMC: 0 SM lid: 1 Capability mask: 0x2651e848 Port GUID: 0x88e9a4ffff5060ac Link layer: InfiniBand CA 'mlx5_1' CA type: MT4123 Number of ports: 1 Firmware version: 20.29.1016 Hardware version: 0 Node GUID: 0x88e9a4ffff504080 System image GUID: 0x88e9a4ffff504080 Port 1: State: Active Physical state: LinkUp Rate: 200 Base lid: 5 LMC: 0 SM lid: 1 Capability mask: 0x2651e848 Port GUID: 0x88e9a4ffff504080 Link layer: InfiniBand CA 'mlx5_2' CA type: MT4123 Number of ports: 1 Firmware version: 20.29.1016 Hardware version: 0 Node GUID: 0x88e9a4ffff505038 System image GUID: 0x88e9a4ffff505038 Port 1: State: Active Physical state: LinkUp Rate: 200 Base lid: 2 LMC: 0 SM lid: 1 Capability mask: 0x2651e848 Port GUID: 0x88e9a4ffff505038 Link layer: InfiniBand CA 'mlx5_3' CA type: MT4123 Number of ports: 1 Firmware version: 20.29.1016 Hardware version: 0 Node GUID: 0x88e9a4ffff504094 System image GUID: 0x88e9a4ffff504094 Port 1: State: Active Physical state: LinkUp Rate: 200 Base lid: 7 LMC: 0 SM lid: 1 Capability mask: 0x2651e848 Port GUID: 0x88e9a4ffff504094 Link layer: InfiniBandCA 'mlx5_0' CA type: MT4123 Number of ports: 1 Firmware version: 20.29.1016 Hardware version: 0 Node GUID: 0x88e9a4ffff5060ac System image GUID: 0x88e9a4ffff5060ac Port 1: State: Active Physical state: LinkUp Rate: 200 Base lid: 8 LMC: 0 SM lid: 1 Capability mask: 0x2651e848 Port GUID: 0x88e9a4ffff5060ac Link layer: InfiniBand CA 'mlx5_1' CA type: MT4123 Number of ports: 1 Firmware version: 20.29.1016 Hardware version: 0 Node GUID: 0x88e9a4ffff504080 System image GUID: 0x88e9a4ffff504080 Port 1: State: Active Physical state: LinkUp Rate: 200 Base lid: 5 LMC: 0 SM lid: 1 Capability mask: 0x2651e848 Port GUID: 0x88e9a4ffff504080 Link layer: InfiniBand CA 'mlx5_2' CA type: MT4123 Number of ports: 1 Firmware version: 20.29.1016 Hardware version: 0 Node GUID: 0x88e9a4ffff505038 System image GUID: 0x88e9a4ffff505038 Port 1: State: Active Physical state: LinkUp Rate: 200 Base lid: 2 LMC: 0 SM lid: 1 Capability mask: 0x2651e848 Port GUID: 0x88e9a4ffff505038 Link layer: InfiniBand CA 'mlx5_3' CA type: MT4123 Number of ports: 1 Firmware version: 20.29.1016 Hardware version: 0 Node GUID: 0x88e9a4ffff504094 System image GUID: 0x88e9a4ffff504094 Port 1: State: Active Physical state: LinkUp Rate: 200 Base lid: 7 LMC: 0 SM lid: 1 Capability mask: 0x2651e848 Port GUID: 0x88e9a4ffff504094 Link layer: InfiniBandLink verification result example
Check IB bandwidth communication
Check the IB bandwidth communication status (ib_send_bw) and follow the steps below to inspect the IaaS hardware level.
Check the name of the IB HCA interface.
- In the following example, IB ports:
mlx5_0,mlx5_4,mlx_5_5,mlx5_8Color mode~$ ibdev2netdev -v~$ ibdev2netdev -vExample of checking IB HCA interface name Color mode0000:1a:00.0 mlx5_0 (MT4129 - 1028SN ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (ACTIVE) ==> ibp26s0 (Down) 0000:1b:00.0 mlx5_1 (MT4123 - 1028SN ) Mellanox ConnectX-6 Single Port VPI HDR QSFP Adapter fw 20.38.1002 port 1 (ACTIVE) ==> bond-nas (Up) 0000:3c:00.0 mlx5_2 (MT4129 - 1028SN ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (DOWN ) ==> ibp60s0 (Down) 0000:4d:00.0 mlx5_3 (MT4129 - 1028SN ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (DOWN ) ==> ibp77s0 (Down) 0000:5e:00.0 mlx5_4 (MT4129 - 1028SN ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (ACTIVE) ==> ibp94s0 (Down) 0000:9c:00.0 mlx5_5 (MT4129 - 1028SN ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (ACTIVE) ==> ibp156s0 (Down) 0000:9d:00.0 mlx5_6 (MT4123 - 1028SN ) Mellanox ConnectX-6 Single Port VPI HDR QSFP Adapter fw 20.38.1002 port 1 (ACTIVE) ==> bond-nas (Up) 0000:bc:00.0 mlx5_7 (MT4129 - 1028SN ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (DOWN ) ==> ibp188s0 (Down) 0000:cc:00.0 mlx5_8 (MT4129 - 1028SN ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (ACTIVE) ==> ibp204s0 (Down) 0000:dc:00.0 mlx5_9 (MT4129 - 1028SN ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (DOWN ) ==> ibp220s0 (Down)0000:1a:00.0 mlx5_0 (MT4129 - 1028SN ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (ACTIVE) ==> ibp26s0 (Down) 0000:1b:00.0 mlx5_1 (MT4123 - 1028SN ) Mellanox ConnectX-6 Single Port VPI HDR QSFP Adapter fw 20.38.1002 port 1 (ACTIVE) ==> bond-nas (Up) 0000:3c:00.0 mlx5_2 (MT4129 - 1028SN ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (DOWN ) ==> ibp60s0 (Down) 0000:4d:00.0 mlx5_3 (MT4129 - 1028SN ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (DOWN ) ==> ibp77s0 (Down) 0000:5e:00.0 mlx5_4 (MT4129 - 1028SN ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (ACTIVE) ==> ibp94s0 (Down) 0000:9c:00.0 mlx5_5 (MT4129 - 1028SN ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (ACTIVE) ==> ibp156s0 (Down) 0000:9d:00.0 mlx5_6 (MT4123 - 1028SN ) Mellanox ConnectX-6 Single Port VPI HDR QSFP Adapter fw 20.38.1002 port 1 (ACTIVE) ==> bond-nas (Up) 0000:bc:00.0 mlx5_7 (MT4129 - 1028SN ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (DOWN ) ==> ibp188s0 (Down) 0000:cc:00.0 mlx5_8 (MT4129 - 1028SN ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (ACTIVE) ==> ibp204s0 (Down) 0000:dc:00.0 mlx5_9 (MT4129 - 1028SN ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (DOWN ) ==> ibp220s0 (Down)Example of IB HCA interface name verification result
- In the following example, IB ports:
Use the SERVER Side command to check the communication status.
Color mode~$ ib_send_bw -d mlx5_0 -i 1 –F~$ ib_send_bw -d mlx5_0 -i 1 –FSERVER Side command example Color mode************************************ * Waiting for client to connect... * * ************************************ --------------------------------------------------------------------------------------- Send BW Test Dual-port : OFF Device : mlx5_0 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON ibv_wr* API : ON TX depth : 128 CQ Moderation : 100 Mtu : 4096[B] Link type : IB Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0x0f QPN 0x6d95 PSN 0xb974a remote address: LID 0x01 QPN 0x6dd2 PSN 0xc8a18c --------------------------------------------------------------------------------------- #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps] 65536 1000 0.00 19827.40 0.317238 ---------------------------------------------------------------------------------------************************************ * Waiting for client to connect... * * ************************************ --------------------------------------------------------------------------------------- Send BW Test Dual-port : OFF Device : mlx5_0 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON ibv_wr* API : ON TX depth : 128 CQ Moderation : 100 Mtu : 4096[B] Link type : IB Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0x0f QPN 0x6d95 PSN 0xb974a remote address: LID 0x01 QPN 0x6dd2 PSN 0xc8a18c --------------------------------------------------------------------------------------- #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps] 65536 1000 0.00 19827.40 0.317238 ---------------------------------------------------------------------------------------Example of mutual communication status check result Use the CLIENT Side command to check the communication status.
Color mode~$ ib_send_bw -d mlx5_0 -i 1 -F <SERVER Side IP>~$ ib_send_bw -d mlx5_0 -i 1 -F <SERVER Side IP>CLIENT Side command example Color mode--------------------------------------------------------------------------------------- Send BW Test Dual-port : OFF Device : mlx5_0 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON ibv_wr* API : ON RX depth : 512 CQ Moderation : 100 Mtu : 4096[B] Link type : IB Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0x01 QPN 0x6dd2 PSN 0xc8a18c remote address: LID 0x0f QPN 0x6d95 PSN 0xb974a --------------------------------------------------------------------------------------- #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps] 65536 1000 19008.49 19006.37 0.304102 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ Send BW Test Dual-port : OFF Device : mlx5_0 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON ibv_wr* API : ON RX depth : 512 CQ Moderation : 100 Mtu : 4096[B] Link type : IB Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0x01 QPN 0x6dd2 PSN 0xc8a18c remote address: LID 0x0f QPN 0x6d95 PSN 0xb974a --------------------------------------------------------------------------------------- #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps] 65536 1000 19008.49 19006.37 0.304102 ---------------------------------------------------------------------------------------Example of mutual communication status check result
Check IB service related kernel modules
Check the relevant kernel modules for the IB service (lsmod) to inspect the IaaS hardware level.
Check IB service-related kernel module - nvidia_peermem
Color mode~$ lsmod | grep nvidia_peermem~$ lsmod | grep nvidia_peermemExample command to check nvidia_peermem Color modenvidia_peermem 16384 0 ib_core 425984 9 rdma_cm,ib_ipoib,nvidia_peermem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm nvidia 56524800 451 nvidia_uvm,nvidia_peermem,nvidia_modesetnvidia_peermem 16384 0 ib_core 425984 9 rdma_cm,ib_ipoib,nvidia_peermem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm nvidia 56524800 451 nvidia_uvm,nvidia_peermem,nvidia_modesetExample of nvidia_peermem check result Check IB service related kernel modules - IB Card (HCA) driver
Color mode~$ lsmod|egrep 'ib_|_ib|mlx'~$ lsmod|egrep 'ib_|_ib|mlx'Example command to check the IB Card (HCA) driver Color modemlx5_ib 393216 0 ib_uverbs 163840 2 irdma,mlx5_ib ib_core 393216 3 irdma,ib_uverbs,mlx5_ib mlx5_core 1593344 1 mlx5_ib mlxfw 32768 1 mlx5_core psample 20480 1 mlx5_core tls 114688 1 mlx5_core pci_hyperv_intf 16384 1 mlx5_core ib_ipoib 139264 0 ib_cm 131072 2 rdma_cm,ib_ipoib ib_umad 40960 0 mlx5_ib 454656 0 ib_uverbs 135168 2 rdma_ucm,mlx5_ib ib_core 434176 9 rdma_cm,ib_ipoib,nvidia_peermem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm libcrc32c 16384 5 nf_conntrack,nf_nat,btrfs,nf_tables,raid456 mlx5_core 2064384 1 mlx5_ib mlx_compat 69632 11 rdma_cm,ib_ipoib,mlxdevm,iw_cm,ib_umad,ib_core,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm,mlx5_coremlx5_ib 393216 0 ib_uverbs 163840 2 irdma,mlx5_ib ib_core 393216 3 irdma,ib_uverbs,mlx5_ib mlx5_core 1593344 1 mlx5_ib mlxfw 32768 1 mlx5_core psample 20480 1 mlx5_core tls 114688 1 mlx5_core pci_hyperv_intf 16384 1 mlx5_core ib_ipoib 139264 0 ib_cm 131072 2 rdma_cm,ib_ipoib ib_umad 40960 0 mlx5_ib 454656 0 ib_uverbs 135168 2 rdma_ucm,mlx5_ib ib_core 434176 9 rdma_cm,ib_ipoib,nvidia_peermem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm libcrc32c 16384 5 nf_conntrack,nf_nat,btrfs,nf_tables,raid456 mlx5_core 2064384 1 mlx5_ib mlx_compat 69632 11 rdma_cm,ib_ipoib,mlxdevm,iw_cm,ib_umad,ib_core,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm,mlx5_coreExample of IB Card (HCA) driver verification result
Check storage physical disk resources and Multi-Path
Verify the storage physical disk resources and Multi-Path to assess the IaaS hardware level.
Storage Physical Disk Resource Check Results
Color moderoot@bm-dev-002:/tmp# fdisk –lroot@bm-dev-002:/tmp# fdisk –lExample of storage physical disk resource check result Multi-Path verification result
Color moderoot@bm-dev-002:/tmp# multipath –llroot@bm-dev-002:/tmp# multipath –llExample of Multi-Path verification result
Check Service Network after new deployment of Multi-node GPU Cluster
Use the following command to verify that the MII Status of Bonding and Slave Interface is up.
Service Network check command
Color mode~$ cat /proc/net/bonding/bond-srv~$ cat /proc/net/bonding/bond-srvExample command to check Service Network Service Network check result
Color modeEthernet Channel Bonding Driver: v5.15.0-25-generic Bonding Mode: fault-tolerance (active-backup) Primary Slave: None Currently Active Slave: ens9f0 MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 0 Down Delay (ms): 0 Peer Notification Delay (ms): 0 Slave Interface: ens9f0 MII Status: up Speed: 100000 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: 30:3e:a7:02:35:70 Slave queue ID: 0 Slave Interface: ens11f0 MII Status: up Speed: 100000 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: 30:3e:a7:02:2f:e8 Slave queue ID: 0Ethernet Channel Bonding Driver: v5.15.0-25-generic Bonding Mode: fault-tolerance (active-backup) Primary Slave: None Currently Active Slave: ens9f0 MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 0 Down Delay (ms): 0 Peer Notification Delay (ms): 0 Slave Interface: ens9f0 MII Status: up Speed: 100000 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: 30:3e:a7:02:35:70 Slave queue ID: 0 Slave Interface: ens11f0 MII Status: up Speed: 100000 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: 30:3e:a7:02:2f:e8 Slave queue ID: 0Example of Service Network verification result
Multi-node GPU Cluster: Verify time synchronization with the Time Server after new deployment
The OS image includes the installation of the chrony daemon and configuration for SCP NTP server synchronization. Use the following command to verify whether a line marked with ^* exists in the MS Name column.
Command to check the chrony daemon source and synchronization status
Color mode~$ chronyc sources -V~$ chronyc sources -VExample of status check command Result of checking chrony daemon status
Color modeMS Name/IP address Stratum Poll Reach LastRx Last sample =============================================================================== ^* 198.19.0.54 4 10 377 1040 -16us[ -37us] +/- 9982usMS Name/IP address Stratum Poll Reach LastRx Last sample =============================================================================== ^* 198.19.0.54 4 10 377 1040 -16us[ -37us] +/- 9982usExample of checking the chrony daemon status
3 - Release Note
Multi-node GPU Cluster
- The Ubuntu image version 24.04 has been added.
- The B300 GPU server type has been added.
- Provides an IaC environment using Terraform.
- You can terminate multiple resources simultaneously from the GPU Node list.
- It must be a node that uses the same DataSet and Cluster Fabric.
- Integrated with Cloud Monitoring.
- You can view key performance metrics in real time with Cloud Monitoring.
- We have launched the Multi-node GPU Cluster service.
- We provide a service that offers physical GPU servers without virtualization for large-scale, high-performance AI computing.