Multi-node GPU Cluster Service Scope and Inspection Guide
Multi-node GPU Cluster Service Scope
If an IaaS hardware-level issue occurs with the Multi-node GPU Cluster service, you can receive technical support through Contact Us in the Support Center. However, the risks associated with changes such as OS kernel updates or application installations are the user’s responsibility, so technical support is limited; please be mindful when performing tasks such as system updates.
IaaS hardware level issue
- HW fault event messages generated within the server by the IPMI hardware monitoring console.
- GPU HW operation error observed in the nvdia-smi command
- HW error messages that occur during inspection of InfiniBand HCA cards or InfiniBand Switches
IaaS HW Inspection Guide
After applying for the Multi-node GPU Cluster service, it is recommended to check the IaaS HW level according to the inspection guide.
Intel E810 driver update
Check the version of the Intel E810 driver and, if necessary, refer to the following procedure to perform an update.
- Server manufacturer Intel E810 driver minimum recommended version: 1.15.4 or later
- Driver download: Intel Network Adapter Driver for 800 Series Devices under Linux
You can use the lspci command to verify whether an E810 NIC device is present.
If the E810 NIC is valid, the PCIe device is identified as follows using the E810-C information (if there is no E810 NIC device, this operation is not performed).
<div class="code-block-buttons">
<button class="code-block-download">
<div class="code-block-download-icon">
<svg width="14" height="16" viewBox="0 0 14 16" fill="none" xmlns="http://www.w3.org/2000/svg"><path d="M13.999 14.4353v-2.998C13.999 11.0232 13.6602 10.6853 13.2461 10.6853S12.5 11.0232 12.5 11.4373v2.2519H1.5V11.4373C1.5 11.0232 1.16211 10.6853.748047 10.6853.333984 10.6853 976563e-9 11.0232 976562e-9 11.4373v2.998C976562e-9 14.8494.333984 15.1892.748047 15.1892H13.2461c.414099999999999.0.7529-.3398.7529-.7539z" fill="#5135ff"/><path d="M1.41169 6.21654c.25876-.32345.73073-.37589 1.05417-.11713l3.78463 3.0277V1.56104c0-.41422.33578-.750005.75-.750005.41421.0.75.335785.75.750005V9.12208L11.5288 6.09941C11.8523 5.84065 12.3242 5.89309 12.583 6.21654 12.8418 6.53999 12.7893 7.01196 12.4659 7.27071L6.99734 11.6455 1.52882 7.27071c-.32345-.25875-.37589-.73072-.11713-1.05417z" fill="#5135ff"/></svg></div>
Sample Code Download
</button>
<button class="code-block-copy">
<div class="code-block-copy-icon">
<svg width="14" height="14" viewBox="0 0 14 14" fill="none" xmlns="http://www.w3.org/2000/svg"><path fill-rule="evenodd" clip-rule="evenodd" d="M10 4.00012V1c0-.552285-.44772-1-1-1H1C.447715.0.0.447715.0 1V9c0 .55228.447715 1 1 1H3.99988v3.0001C3.99988 13.5515 4.44849 14 4.99988 14H10.0002l3.9997-3.9995V5c0-.55139-.448600000000001-.99988-1-.99988H10zM1.4 1.4V8.6H3.99988V5c0-.55139.44861-.99988 1-.99988H8.6V1.4H1.4zM5.3999 12.6H9.08295V9.78301C9.08295 9.39641 9.39635 9.08301 9.78295 9.08301H12.5999V5.40015h-7.2V12.6zm6.1375-2.117-1.0545 1.0544V10.483h1.0545z" fill="currentcolor"/></svg></div>
Copy Code
</button>
</div>
lspci | grep E810
0000:6a:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-C for QSFP (rev 02) 0000:6a:00.1 Ethernet controller: Intel Corporation Ethernet Controller E810-C for QSFP (rev 02)lspci | grep E810
0000:6a:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-C for QSFP (rev 02) 0000:6a:00.1 Ethernet controller: Intel Corporation Ethernet Controller E810-C for QSFP (rev 02)</div>
To update the driver, follow the steps below.
Move the base driver tar file to the desired directory.
Color mode/usr/local/src/usr/local/srcDirectory navigation example Untar / unzip the Archiver file.
- x.x.x is the version number of the driver tar file.Color mode
tar zxf ice-x.x.x.tar.gztar zxf ice-x.x.x.tar.gzuntar example
- x.x.x is the version number of the driver tar file.
Change the driver to the src directory.
- x.x.x is the version number of the driver tar file.Color mode
cd ice-x.x.x/src/cd ice-x.x.x/src/Example of changing to the src directory
- x.x.x is the version number of the driver tar file.
Compile the driver module.
Color modemake installmake installDriver module compilation example After the update is complete, check the version.
Color modelsmod | grep ice modinfo ice | grep versionlsmod | grep ice modinfo ice | grep versionVersion check example
Check NVIDIA driver
To check the NVIDIA driver (nvidia-smi topo, IB nv_peer_mem status) and inspect the IaaS hardware level, follow these steps.
Check the GPU driver status.
Color mode~$ nvidia-smi~$ nvidia-smiExample code for checking GPU driver status Color modeThu Jan 29 14:48:31 2026 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.183.06 Driver Version: 535.183.06 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | | =========================================+======================+====================== | | 0 NVIDIA A100-SXM4-80GB On | 00000000:00:05.0 Off | On | | N/A 36C P0 52W / 400W | 0MiB / 81920MiB | N/A Default | | | | Enabled | +-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA A100-SXM4-80GB On | 00000000:00:06.0 Off | 0 | | N/A 36C P0 61W / 400W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 2 NVIDIA A100-SXM4-80GB On | 00000000:00:07.0 Off | 0 | | N/A 36C P0 64W / 400W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 3 NVIDIA A100-SXM4-80GB On | 00000000:00:08.0 Off | 0 | | N/A 40C P0 64W / 400W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 4 NVIDIA A100-SXM4-80GB On | 00000000:00:09.0 Off | 0 | | N/A 36C P0 63W / 400W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 5 NVIDIA A100-SXM4-80GB On | 00000000:00:0A.0 Off | 0 | | N/A 40C P0 64W / 400W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 6 NVIDIA A100-SXM4-80GB On | 00000000:00:0B.0 Off | 0 | | N/A 39C P0 65W / 400W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 7 NVIDIA A100-SXM4-80GB On | 00000000:00:0C.0 Off | 0 | | N/A 39C P0 60W / 400W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | MIG devices: | +------------------+--------------------------------+-----------+-----------------------+ | GPU GI CI MIG | Memory-Usage | Vol | Shared | | ID ID Dev | BAR1-Usage | SM Unc | CE ENC DEC OFA JPG | | | | ECC | | | ==================+================================+===========+======================= | | No MIG devices found | +---------------------------------------------------------------------------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | | ======================================================================================= | | No running processes found | +---------------------------------------------------------------------------------------+Thu Jan 29 14:48:31 2026 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.183.06 Driver Version: 535.183.06 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | | =========================================+======================+====================== | | 0 NVIDIA A100-SXM4-80GB On | 00000000:00:05.0 Off | On | | N/A 36C P0 52W / 400W | 0MiB / 81920MiB | N/A Default | | | | Enabled | +-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA A100-SXM4-80GB On | 00000000:00:06.0 Off | 0 | | N/A 36C P0 61W / 400W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 2 NVIDIA A100-SXM4-80GB On | 00000000:00:07.0 Off | 0 | | N/A 36C P0 64W / 400W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 3 NVIDIA A100-SXM4-80GB On | 00000000:00:08.0 Off | 0 | | N/A 40C P0 64W / 400W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 4 NVIDIA A100-SXM4-80GB On | 00000000:00:09.0 Off | 0 | | N/A 36C P0 63W / 400W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 5 NVIDIA A100-SXM4-80GB On | 00000000:00:0A.0 Off | 0 | | N/A 40C P0 64W / 400W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 6 NVIDIA A100-SXM4-80GB On | 00000000:00:0B.0 Off | 0 | | N/A 39C P0 65W / 400W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 7 NVIDIA A100-SXM4-80GB On | 00000000:00:0C.0 Off | 0 | | N/A 39C P0 60W / 400W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | MIG devices: | +------------------+--------------------------------+-----------+-----------------------+ | GPU GI CI MIG | Memory-Usage | Vol | Shared | | ID ID Dev | BAR1-Usage | SM Unc | CE ENC DEC OFA JPG | | | | ECC | | | ==================+================================+===========+======================= | | No MIG devices found | +---------------------------------------------------------------------------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | | ======================================================================================= | | No running processes found | +---------------------------------------------------------------------------------------+GPU driver status example Check the NVSwitch and NVLink hardware status.
Check NVSwitch status
Color mode~$ nvidia-smi nvlink --status~$ nvidia-smi nvlink --statusNVSwitch status check example Color modeGPU 1: NVIDIA A100-SXM4-80GB (UUID: GPU-64a2f685-bb12-c4af-105c-0726ece9c8d7) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 2: NVIDIA A100-SXM4-80GB (UUID: GPU-2269851b-71cd-f6c7-50c5-ba1525cf3ce8) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 3: NVIDIA A100-SXM4-80GB (UUID: GPU-4c397bbf-95fc-5c29-918a-a429cbe45a7a) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 4: NVIDIA A100-SXM4-80GB (UUID: GPU-0e350204-9fb6-2cbe-538e-8f7849658eb8) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 5: NVIDIA A100-SXM4-80GB (UUID: GPU-45f0c453-4760-edd4-3af9-25c5ea7473a5) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 6: NVIDIA A100-SXM4-80GB (UUID: GPU-38409794-bb34-430e-3c50-90b42cb2bb72) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 7: NVIDIA A100-SXM4-80GB (UUID: GPU-3fb478aa-801b-eb64-55c2-0ffc3f2ce404) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/sGPU 1: NVIDIA A100-SXM4-80GB (UUID: GPU-64a2f685-bb12-c4af-105c-0726ece9c8d7) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 2: NVIDIA A100-SXM4-80GB (UUID: GPU-2269851b-71cd-f6c7-50c5-ba1525cf3ce8) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 3: NVIDIA A100-SXM4-80GB (UUID: GPU-4c397bbf-95fc-5c29-918a-a429cbe45a7a) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 4: NVIDIA A100-SXM4-80GB (UUID: GPU-0e350204-9fb6-2cbe-538e-8f7849658eb8) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 5: NVIDIA A100-SXM4-80GB (UUID: GPU-45f0c453-4760-edd4-3af9-25c5ea7473a5) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 6: NVIDIA A100-SXM4-80GB (UUID: GPU-38409794-bb34-430e-3c50-90b42cb2bb72) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 7: NVIDIA A100-SXM4-80GB (UUID: GPU-3fb478aa-801b-eb64-55c2-0ffc3f2ce404) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/sNVSwitch status example Check NVLinks hardware status
Color mode~$ nvidia-smi topo -m~$ nvidia-smi topo -mExample code for checking NVLink hardware status Color modeGPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X NV12 NV12 NV12 NV12 NV12 NV12 NV12 0-127 0-7 N/A GPU1 NV12 X NV12 NV12 NV12 NV12 NV12 NV12 0-127 0-7 N/A GPU2 NV12 NV12 X NV12 NV12 NV12 NV12 NV12 0-127 0-7 N/A GPU3 NV12 NV12 NV12 X NV12 NV12 NV12 NV12 0-127 0-7 N/A GPU4 NV12 NV12 NV12 NV12 X NV12 NV12 NV12 0-127 0-7 N/A GPU5 NV12 NV12 NV12 NV12 NV12 X NV12 NV12 0-127 0-7 N/A GPU6 NV12 NV12 NV12 NV12 NV12 NV12 X NV12 0-127 0-7 N/A GPU7 NV12 NV12 NV12 NV12 NV12 NV12 NV12 X 0-127 0-7 N/A Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinksGPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X NV12 NV12 NV12 NV12 NV12 NV12 NV12 0-127 0-7 N/A GPU1 NV12 X NV12 NV12 NV12 NV12 NV12 NV12 0-127 0-7 N/A GPU2 NV12 NV12 X NV12 NV12 NV12 NV12 NV12 0-127 0-7 N/A GPU3 NV12 NV12 NV12 X NV12 NV12 NV12 NV12 0-127 0-7 N/A GPU4 NV12 NV12 NV12 NV12 X NV12 NV12 NV12 0-127 0-7 N/A GPU5 NV12 NV12 NV12 NV12 NV12 X NV12 NV12 0-127 0-7 N/A GPU6 NV12 NV12 NV12 NV12 NV12 NV12 X NV12 0-127 0-7 N/A GPU7 NV12 NV12 NV12 NV12 NV12 NV12 NV12 X 0-127 0-7 N/A Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinksNVLink HW status check code example
Check the InfiniBand (IB) HCA card hardware status and link.
Color modeuser@bm-dev-001:~$ ibdev2netdev -vuser@bm-dev-001:~$ ibdev2netdev -vHW status check command example Color modecat: /sys/class/infiniband/mlx5_0/device/vpd: Permission denied 0000:45:00.0 mlx5_0 (MT4123 - ) fw 20.29.1016 port 1 (ACTIVE) ==> ibs18 (Down) cat: /sys/class/infiniband/mlx5_1/device/vpd: Permission denied 0000:0e:00.0 mlx5_1 (MT4123 - ) fw 20.29.1016 port 1 (ACTIVE) ==> ibs17 (Down) cat: /sys/class/infiniband/mlx5_2/device/vpd: Permission denied 0000:c5:00.0 mlx5_2 (MT4123 - ) fw 20.29.1016 port 1 (ACTIVE) ==> ibs20 (Down) cat: /sys/class/infiniband/mlx5_3/device/vpd: Permission denied 0000:85:00.0 mlx5_3 (MT4123 - ) fw 20.29.1016 port 1 (ACTIVE) ==> ibs19 (Down) user@bm-dev-001:~$cat: /sys/class/infiniband/mlx5_0/device/vpd: Permission denied 0000:45:00.0 mlx5_0 (MT4123 - ) fw 20.29.1016 port 1 (ACTIVE) ==> ibs18 (Down) cat: /sys/class/infiniband/mlx5_1/device/vpd: Permission denied 0000:0e:00.0 mlx5_1 (MT4123 - ) fw 20.29.1016 port 1 (ACTIVE) ==> ibs17 (Down) cat: /sys/class/infiniband/mlx5_2/device/vpd: Permission denied 0000:c5:00.0 mlx5_2 (MT4123 - ) fw 20.29.1016 port 1 (ACTIVE) ==> ibs20 (Down) cat: /sys/class/infiniband/mlx5_3/device/vpd: Permission denied 0000:85:00.0 mlx5_3 (MT4123 - ) fw 20.29.1016 port 1 (ACTIVE) ==> ibs19 (Down) user@bm-dev-001:~$Example of HW status check result Color moderoot@bm-dev-001:~# ibstatroot@bm-dev-001:~# ibstatExample of link verification command Color modeCA 'mlx5_0' CA type: MT4123 Number of ports: 1 Firmware version: 20.29.1016 Hardware version: 0 Node GUID: 0x88e9a4ffff5060ac System image GUID: 0x88e9a4ffff5060ac Port 1: State: Active Physical state: LinkUp Rate: 200 Base lid: 8 LMC: 0 SM lid: 1 Capability mask: 0x2651e848 Port GUID: 0x88e9a4ffff5060ac Link layer: InfiniBand CA 'mlx5_1' CA type: MT4123 Number of ports: 1 Firmware version: 20.29.1016 Hardware version: 0 Node GUID: 0x88e9a4ffff504080 System image GUID: 0x88e9a4ffff504080 Port 1: State: Active Physical state: LinkUp Rate: 200 Base lid: 5 LMC: 0 SM lid: 1 Capability mask: 0x2651e848 Port GUID: 0x88e9a4ffff504080 Link layer: InfiniBand CA 'mlx5_2' CA type: MT4123 Number of ports: 1 Firmware version: 20.29.1016 Hardware version: 0 Node GUID: 0x88e9a4ffff505038 System image GUID: 0x88e9a4ffff505038 Port 1: State: Active Physical state: LinkUp Rate: 200 Base lid: 2 LMC: 0 SM lid: 1 Capability mask: 0x2651e848 Port GUID: 0x88e9a4ffff505038 Link layer: InfiniBand CA 'mlx5_3' CA type: MT4123 Number of ports: 1 Firmware version: 20.29.1016 Hardware version: 0 Node GUID: 0x88e9a4ffff504094 System image GUID: 0x88e9a4ffff504094 Port 1: State: Active Physical state: LinkUp Rate: 200 Base lid: 7 LMC: 0 SM lid: 1 Capability mask: 0x2651e848 Port GUID: 0x88e9a4ffff504094 Link layer: InfiniBandCA 'mlx5_0' CA type: MT4123 Number of ports: 1 Firmware version: 20.29.1016 Hardware version: 0 Node GUID: 0x88e9a4ffff5060ac System image GUID: 0x88e9a4ffff5060ac Port 1: State: Active Physical state: LinkUp Rate: 200 Base lid: 8 LMC: 0 SM lid: 1 Capability mask: 0x2651e848 Port GUID: 0x88e9a4ffff5060ac Link layer: InfiniBand CA 'mlx5_1' CA type: MT4123 Number of ports: 1 Firmware version: 20.29.1016 Hardware version: 0 Node GUID: 0x88e9a4ffff504080 System image GUID: 0x88e9a4ffff504080 Port 1: State: Active Physical state: LinkUp Rate: 200 Base lid: 5 LMC: 0 SM lid: 1 Capability mask: 0x2651e848 Port GUID: 0x88e9a4ffff504080 Link layer: InfiniBand CA 'mlx5_2' CA type: MT4123 Number of ports: 1 Firmware version: 20.29.1016 Hardware version: 0 Node GUID: 0x88e9a4ffff505038 System image GUID: 0x88e9a4ffff505038 Port 1: State: Active Physical state: LinkUp Rate: 200 Base lid: 2 LMC: 0 SM lid: 1 Capability mask: 0x2651e848 Port GUID: 0x88e9a4ffff505038 Link layer: InfiniBand CA 'mlx5_3' CA type: MT4123 Number of ports: 1 Firmware version: 20.29.1016 Hardware version: 0 Node GUID: 0x88e9a4ffff504094 System image GUID: 0x88e9a4ffff504094 Port 1: State: Active Physical state: LinkUp Rate: 200 Base lid: 7 LMC: 0 SM lid: 1 Capability mask: 0x2651e848 Port GUID: 0x88e9a4ffff504094 Link layer: InfiniBandLink verification result example
Check IB bandwidth communication
Check the IB bandwidth communication status (ib_send_bw) and follow the steps below to inspect the IaaS hardware level.
Check the name of the IB HCA interface.
- In the following example, IB ports:
mlx5_0,mlx5_4,mlx_5_5,mlx5_8Color mode~$ ibdev2netdev -v~$ ibdev2netdev -vExample of checking IB HCA interface name Color mode0000:1a:00.0 mlx5_0 (MT4129 - 1028SN ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (ACTIVE) ==> ibp26s0 (Down) 0000:1b:00.0 mlx5_1 (MT4123 - 1028SN ) Mellanox ConnectX-6 Single Port VPI HDR QSFP Adapter fw 20.38.1002 port 1 (ACTIVE) ==> bond-nas (Up) 0000:3c:00.0 mlx5_2 (MT4129 - 1028SN ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (DOWN ) ==> ibp60s0 (Down) 0000:4d:00.0 mlx5_3 (MT4129 - 1028SN ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (DOWN ) ==> ibp77s0 (Down) 0000:5e:00.0 mlx5_4 (MT4129 - 1028SN ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (ACTIVE) ==> ibp94s0 (Down) 0000:9c:00.0 mlx5_5 (MT4129 - 1028SN ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (ACTIVE) ==> ibp156s0 (Down) 0000:9d:00.0 mlx5_6 (MT4123 - 1028SN ) Mellanox ConnectX-6 Single Port VPI HDR QSFP Adapter fw 20.38.1002 port 1 (ACTIVE) ==> bond-nas (Up) 0000:bc:00.0 mlx5_7 (MT4129 - 1028SN ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (DOWN ) ==> ibp188s0 (Down) 0000:cc:00.0 mlx5_8 (MT4129 - 1028SN ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (ACTIVE) ==> ibp204s0 (Down) 0000:dc:00.0 mlx5_9 (MT4129 - 1028SN ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (DOWN ) ==> ibp220s0 (Down)0000:1a:00.0 mlx5_0 (MT4129 - 1028SN ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (ACTIVE) ==> ibp26s0 (Down) 0000:1b:00.0 mlx5_1 (MT4123 - 1028SN ) Mellanox ConnectX-6 Single Port VPI HDR QSFP Adapter fw 20.38.1002 port 1 (ACTIVE) ==> bond-nas (Up) 0000:3c:00.0 mlx5_2 (MT4129 - 1028SN ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (DOWN ) ==> ibp60s0 (Down) 0000:4d:00.0 mlx5_3 (MT4129 - 1028SN ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (DOWN ) ==> ibp77s0 (Down) 0000:5e:00.0 mlx5_4 (MT4129 - 1028SN ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (ACTIVE) ==> ibp94s0 (Down) 0000:9c:00.0 mlx5_5 (MT4129 - 1028SN ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (ACTIVE) ==> ibp156s0 (Down) 0000:9d:00.0 mlx5_6 (MT4123 - 1028SN ) Mellanox ConnectX-6 Single Port VPI HDR QSFP Adapter fw 20.38.1002 port 1 (ACTIVE) ==> bond-nas (Up) 0000:bc:00.0 mlx5_7 (MT4129 - 1028SN ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (DOWN ) ==> ibp188s0 (Down) 0000:cc:00.0 mlx5_8 (MT4129 - 1028SN ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (ACTIVE) ==> ibp204s0 (Down) 0000:dc:00.0 mlx5_9 (MT4129 - 1028SN ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (DOWN ) ==> ibp220s0 (Down)Example of IB HCA interface name verification result
- In the following example, IB ports:
Use the SERVER Side command to check the communication status.
Color mode~$ ib_send_bw -d mlx5_0 -i 1 –F~$ ib_send_bw -d mlx5_0 -i 1 –FSERVER Side command example Color mode************************************ * Waiting for client to connect... * * ************************************ --------------------------------------------------------------------------------------- Send BW Test Dual-port : OFF Device : mlx5_0 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON ibv_wr* API : ON TX depth : 128 CQ Moderation : 100 Mtu : 4096[B] Link type : IB Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0x0f QPN 0x6d95 PSN 0xb974a remote address: LID 0x01 QPN 0x6dd2 PSN 0xc8a18c --------------------------------------------------------------------------------------- #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps] 65536 1000 0.00 19827.40 0.317238 ---------------------------------------------------------------------------------------************************************ * Waiting for client to connect... * * ************************************ --------------------------------------------------------------------------------------- Send BW Test Dual-port : OFF Device : mlx5_0 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON ibv_wr* API : ON TX depth : 128 CQ Moderation : 100 Mtu : 4096[B] Link type : IB Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0x0f QPN 0x6d95 PSN 0xb974a remote address: LID 0x01 QPN 0x6dd2 PSN 0xc8a18c --------------------------------------------------------------------------------------- #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps] 65536 1000 0.00 19827.40 0.317238 ---------------------------------------------------------------------------------------Example of mutual communication status check result Use the CLIENT Side command to check the communication status.
Color mode~$ ib_send_bw -d mlx5_0 -i 1 -F <SERVER Side IP>~$ ib_send_bw -d mlx5_0 -i 1 -F <SERVER Side IP>CLIENT Side command example Color mode--------------------------------------------------------------------------------------- Send BW Test Dual-port : OFF Device : mlx5_0 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON ibv_wr* API : ON RX depth : 512 CQ Moderation : 100 Mtu : 4096[B] Link type : IB Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0x01 QPN 0x6dd2 PSN 0xc8a18c remote address: LID 0x0f QPN 0x6d95 PSN 0xb974a --------------------------------------------------------------------------------------- #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps] 65536 1000 19008.49 19006.37 0.304102 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ Send BW Test Dual-port : OFF Device : mlx5_0 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON ibv_wr* API : ON RX depth : 512 CQ Moderation : 100 Mtu : 4096[B] Link type : IB Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0x01 QPN 0x6dd2 PSN 0xc8a18c remote address: LID 0x0f QPN 0x6d95 PSN 0xb974a --------------------------------------------------------------------------------------- #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps] 65536 1000 19008.49 19006.37 0.304102 ---------------------------------------------------------------------------------------Example of mutual communication status check result
Check IB service related kernel modules
Check the relevant kernel modules for the IB service (lsmod) to inspect the IaaS hardware level.
Check IB service-related kernel module - nvidia_peermem
Color mode~$ lsmod | grep nvidia_peermem~$ lsmod | grep nvidia_peermemExample command to check nvidia_peermem Color modenvidia_peermem 16384 0 ib_core 425984 9 rdma_cm,ib_ipoib,nvidia_peermem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm nvidia 56524800 451 nvidia_uvm,nvidia_peermem,nvidia_modesetnvidia_peermem 16384 0 ib_core 425984 9 rdma_cm,ib_ipoib,nvidia_peermem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm nvidia 56524800 451 nvidia_uvm,nvidia_peermem,nvidia_modesetExample of nvidia_peermem check result Check IB service related kernel modules - IB Card (HCA) driver
Color mode~$ lsmod|egrep 'ib_|_ib|mlx'~$ lsmod|egrep 'ib_|_ib|mlx'Example command to check the IB Card (HCA) driver Color modemlx5_ib 393216 0 ib_uverbs 163840 2 irdma,mlx5_ib ib_core 393216 3 irdma,ib_uverbs,mlx5_ib mlx5_core 1593344 1 mlx5_ib mlxfw 32768 1 mlx5_core psample 20480 1 mlx5_core tls 114688 1 mlx5_core pci_hyperv_intf 16384 1 mlx5_core ib_ipoib 139264 0 ib_cm 131072 2 rdma_cm,ib_ipoib ib_umad 40960 0 mlx5_ib 454656 0 ib_uverbs 135168 2 rdma_ucm,mlx5_ib ib_core 434176 9 rdma_cm,ib_ipoib,nvidia_peermem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm libcrc32c 16384 5 nf_conntrack,nf_nat,btrfs,nf_tables,raid456 mlx5_core 2064384 1 mlx5_ib mlx_compat 69632 11 rdma_cm,ib_ipoib,mlxdevm,iw_cm,ib_umad,ib_core,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm,mlx5_coremlx5_ib 393216 0 ib_uverbs 163840 2 irdma,mlx5_ib ib_core 393216 3 irdma,ib_uverbs,mlx5_ib mlx5_core 1593344 1 mlx5_ib mlxfw 32768 1 mlx5_core psample 20480 1 mlx5_core tls 114688 1 mlx5_core pci_hyperv_intf 16384 1 mlx5_core ib_ipoib 139264 0 ib_cm 131072 2 rdma_cm,ib_ipoib ib_umad 40960 0 mlx5_ib 454656 0 ib_uverbs 135168 2 rdma_ucm,mlx5_ib ib_core 434176 9 rdma_cm,ib_ipoib,nvidia_peermem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm libcrc32c 16384 5 nf_conntrack,nf_nat,btrfs,nf_tables,raid456 mlx5_core 2064384 1 mlx5_ib mlx_compat 69632 11 rdma_cm,ib_ipoib,mlxdevm,iw_cm,ib_umad,ib_core,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm,mlx5_coreExample of IB Card (HCA) driver verification result
Check storage physical disk resources and Multi-Path
Verify the storage physical disk resources and Multi-Path to assess the IaaS hardware level.
Storage Physical Disk Resource Check Results
Color moderoot@bm-dev-002:/tmp# fdisk –lroot@bm-dev-002:/tmp# fdisk –lExample of storage physical disk resource check result Multi-Path verification result
Color moderoot@bm-dev-002:/tmp# multipath –llroot@bm-dev-002:/tmp# multipath –llExample of Multi-Path verification result
Check Service Network after new deployment of Multi-node GPU Cluster
Use the following command to verify that the MII Status of Bonding and Slave Interface is up.
Service Network check command
Color mode~$ cat /proc/net/bonding/bond-srv~$ cat /proc/net/bonding/bond-srvExample command to check Service Network Service Network check result
Color modeEthernet Channel Bonding Driver: v5.15.0-25-generic Bonding Mode: fault-tolerance (active-backup) Primary Slave: None Currently Active Slave: ens9f0 MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 0 Down Delay (ms): 0 Peer Notification Delay (ms): 0 Slave Interface: ens9f0 MII Status: up Speed: 100000 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: 30:3e:a7:02:35:70 Slave queue ID: 0 Slave Interface: ens11f0 MII Status: up Speed: 100000 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: 30:3e:a7:02:2f:e8 Slave queue ID: 0Ethernet Channel Bonding Driver: v5.15.0-25-generic Bonding Mode: fault-tolerance (active-backup) Primary Slave: None Currently Active Slave: ens9f0 MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 0 Down Delay (ms): 0 Peer Notification Delay (ms): 0 Slave Interface: ens9f0 MII Status: up Speed: 100000 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: 30:3e:a7:02:35:70 Slave queue ID: 0 Slave Interface: ens11f0 MII Status: up Speed: 100000 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: 30:3e:a7:02:2f:e8 Slave queue ID: 0Example of Service Network verification result
Multi-node GPU Cluster: Verify time synchronization with the Time Server after new deployment
The OS image includes the installation of the chrony daemon and configuration for SCP NTP server synchronization. Use the following command to verify whether a line marked with ^* exists in the MS Name column.
Command to check the chrony daemon source and synchronization status
Color mode~$ chronyc sources -V~$ chronyc sources -VExample of status check command Result of checking chrony daemon status
Color modeMS Name/IP address Stratum Poll Reach LastRx Last sample =============================================================================== ^* 198.19.0.54 4 10 377 1040 -16us[ -37us] +/- 9982usMS Name/IP address Stratum Poll Reach LastRx Last sample =============================================================================== ^* 198.19.0.54 4 10 377 1040 -16us[ -37us] +/- 9982usExample of checking the chrony daemon status