The page has been translated by Gen AI.

Multi-node GPU Cluster Service Scope and Inspection Guide

Multi-node GPU Cluster Service Scope

If an IaaS hardware-level issue occurs with the Multi-node GPU Cluster service, you can receive technical support through Contact Us in the Support Center. However, the risks associated with changes such as OS kernel updates or application installations are the user’s responsibility, so technical support is limited; please be mindful when performing tasks such as system updates.

IaaS hardware level issue

  • HW fault event messages generated within the server by the IPMI hardware monitoring console.
  • GPU HW operation error observed in the nvdia-smi command
  • HW error messages that occur during inspection of InfiniBand HCA cards or InfiniBand Switches
Caution
Since the Multi-node GPU Cluster is a service sensitive to software version compatibility of Ubuntu OS / NVDIA / Infiniband, official technical support is unavailable after changes such as a user’s OS kernel update or application installation.

IaaS HW Inspection Guide

After applying for the Multi-node GPU Cluster service, it is recommended to check the IaaS HW level according to the inspection guide.

Intel E810 driver update

Check the version of the Intel E810 driver and, if necessary, refer to the following procedure to perform an update.

Caution
Proceed with the update only if the Multi-node GPU Cluster Node has an Intel E810 Device and is simultaneously using a standard image version 535.86.10 or lower.
Reference

You can use the lspci command to verify whether an E810 NIC device is present. If the E810 NIC is valid, the PCIe device is identified as follows using the E810-C information (if there is no E810 NIC device, this operation is not performed).

<div class="code-block-buttons">
  
  <button class="code-block-download">
    <div class="code-block-download-icon">
      <svg width="14" height="16" viewBox="0 0 14 16" fill="none" xmlns="http://www.w3.org/2000/svg"><path d="M13.999 14.4353v-2.998C13.999 11.0232 13.6602 10.6853 13.2461 10.6853S12.5 11.0232 12.5 11.4373v2.2519H1.5V11.4373C1.5 11.0232 1.16211 10.6853.748047 10.6853.333984 10.6853 976563e-9 11.0232 976562e-9 11.4373v2.998C976562e-9 14.8494.333984 15.1892.748047 15.1892H13.2461c.414099999999999.0.7529-.3398.7529-.7539z" fill="#5135ff"/><path d="M1.41169 6.21654c.25876-.32345.73073-.37589 1.05417-.11713l3.78463 3.0277V1.56104c0-.41422.33578-.750005.75-.750005.41421.0.75.335785.75.750005V9.12208L11.5288 6.09941C11.8523 5.84065 12.3242 5.89309 12.583 6.21654 12.8418 6.53999 12.7893 7.01196 12.4659 7.27071L6.99734 11.6455 1.52882 7.27071c-.32345-.25875-.37589-.73072-.11713-1.05417z" fill="#5135ff"/></svg></div>
    Sample Code Download
  </button>
  
  <button class="code-block-copy">
    <div class="code-block-copy-icon">
      <svg width="14" height="14" viewBox="0 0 14 14" fill="none" xmlns="http://www.w3.org/2000/svg"><path fill-rule="evenodd" clip-rule="evenodd" d="M10 4.00012V1c0-.552285-.44772-1-1-1H1C.447715.0.0.447715.0 1V9c0 .55228.447715 1 1 1H3.99988v3.0001C3.99988 13.5515 4.44849 14 4.99988 14H10.0002l3.9997-3.9995V5c0-.55139-.448600000000001-.99988-1-.99988H10zM1.4 1.4V8.6H3.99988V5c0-.55139.44861-.99988 1-.99988H8.6V1.4H1.4zM5.3999 12.6H9.08295V9.78301C9.08295 9.39641 9.39635 9.08301 9.78295 9.08301H12.5999V5.40015h-7.2V12.6zm6.1375-2.117-1.0545 1.0544V10.483h1.0545z" fill="currentcolor"/></svg></div>
    Copy Code
  </button>
</div>
Color mode
lspci | grep E810
0000:6a:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-C for QSFP (rev 02) 0000:6a:00.1 Ethernet controller: Intel Corporation Ethernet Controller E810-C for QSFP (rev 02)
lspci | grep E810
0000:6a:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-C for QSFP (rev 02) 0000:6a:00.1 Ethernet controller: Intel Corporation Ethernet Controller E810-C for QSFP (rev 02)
</div>
Example of checking for the presence of an E810 NIC device

To update the driver, follow the steps below.

  1. Move the base driver tar file to the desired directory.

    Color mode
    /usr/local/src
    /usr/local/src
    Directory navigation example

  2. Untar / unzip the Archiver file.

    • x.x.x is the version number of the driver tar file.
      Color mode
      tar zxf ice-x.x.x.tar.gz
      tar zxf ice-x.x.x.tar.gz
      untar example
  3. Change the driver to the src directory.

    • x.x.x is the version number of the driver tar file.
      Color mode
      cd ice-x.x.x/src/
      cd ice-x.x.x/src/
      Example of changing to the src directory
  4. Compile the driver module.

    Color mode
    make install
    make install
    Driver module compilation example

  5. After the update is complete, check the version.

    Color mode
    lsmod | grep ice
    modinfo ice | grep version
    lsmod | grep ice
    modinfo ice | grep version
    Version check example

Check NVIDIA driver

To check the NVIDIA driver (nvidia-smi topo, IB nv_peer_mem status) and inspect the IaaS hardware level, follow these steps.

Reference
The example of applying MIG is described using an A100 GPU node as the reference.
  1. Check the GPU driver status.

    Color mode
    ~$ nvidia-smi
    ~$ nvidia-smi
    Example code for checking GPU driver status
    Color mode
    Thu Jan 29 14:48:31 2026
    +---------------------------------------------------------------------------------------+
    | NVIDIA-SMI 535.183.06             Driver Version: 535.183.06   CUDA Version: 12.2 |
    |-----------------------------------------+----------------------+----------------------+
    | GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp   Perf          Pwr:Usage/Cap | Memory-Usage | GPU-Util  Compute M. |
    |  |  | MIG M. |
    | =========================================+======================+====================== |
    | 0  NVIDIA A100-SXM4-80GB          On | 00000000:00:05.0 Off | On |
    | N/A   36C    P0              52W / 400W | 0MiB / 81920MiB | N/A      Default |
    |  |  | Enabled |
    +-----------------------------------------+----------------------+----------------------+
    | 1  NVIDIA A100-SXM4-80GB          On | 00000000:00:06.0 Off | 0 |
    | N/A   36C    P0              61W / 400W | 0MiB / 81920MiB | 0%      Default |
    |  |  | Disabled |
    +-----------------------------------------+----------------------+----------------------+
    | 2  NVIDIA A100-SXM4-80GB          On | 00000000:00:07.0 Off | 0 |
    | N/A   36C    P0              64W / 400W | 0MiB / 81920MiB | 0%      Default |
    |  |  | Disabled |
    +-----------------------------------------+----------------------+----------------------+
    | 3  NVIDIA A100-SXM4-80GB          On | 00000000:00:08.0 Off | 0 |
    | N/A   40C    P0              64W / 400W | 0MiB / 81920MiB | 0%      Default |
    |  |  | Disabled |
    +-----------------------------------------+----------------------+----------------------+
    | 4  NVIDIA A100-SXM4-80GB          On | 00000000:00:09.0 Off | 0 |
    | N/A   36C    P0              63W / 400W | 0MiB / 81920MiB | 0%      Default |
    |  |  | Disabled |
    +-----------------------------------------+----------------------+----------------------+
    | 5  NVIDIA A100-SXM4-80GB          On | 00000000:00:0A.0 Off | 0 |
    | N/A   40C    P0              64W / 400W | 0MiB / 81920MiB | 0%      Default |
    |  |  | Disabled |
    +-----------------------------------------+----------------------+----------------------+
    | 6  NVIDIA A100-SXM4-80GB          On | 00000000:00:0B.0 Off | 0 |
    | N/A   39C    P0              65W / 400W | 0MiB / 81920MiB | 0%      Default |
    |  |  | Disabled |
    +-----------------------------------------+----------------------+----------------------+
    | 7  NVIDIA A100-SXM4-80GB          On | 00000000:00:0C.0 Off | 0 |
    | N/A   39C    P0              60W / 400W | 0MiB / 81920MiB | 0%      Default |
    |  |  | Disabled |
    +-----------------------------------------+----------------------+----------------------+
    
    +---------------------------------------------------------------------------------------+
    | MIG devices: |
    +------------------+--------------------------------+-----------+-----------------------+
    | GPU  GI  CI  MIG | Memory-Usage | Vol | Shared |
    | ID  ID  Dev | BAR1-Usage | SM     Unc | CE ENC DEC OFA JPG |
    |  |  | ECC |  |
    | ==================+================================+===========+======================= |
    | No MIG devices found |
    +---------------------------------------------------------------------------------------+
    
    +---------------------------------------------------------------------------------------+
    | Processes: |
    | GPU   GI   CI        PID   Type   Process name                            GPU Memory |
    | ID   ID                                                             Usage |
    | ======================================================================================= |
    | No running processes found |
    +---------------------------------------------------------------------------------------+
    Thu Jan 29 14:48:31 2026
    +---------------------------------------------------------------------------------------+
    | NVIDIA-SMI 535.183.06             Driver Version: 535.183.06   CUDA Version: 12.2 |
    |-----------------------------------------+----------------------+----------------------+
    | GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp   Perf          Pwr:Usage/Cap | Memory-Usage | GPU-Util  Compute M. |
    |  |  | MIG M. |
    | =========================================+======================+====================== |
    | 0  NVIDIA A100-SXM4-80GB          On | 00000000:00:05.0 Off | On |
    | N/A   36C    P0              52W / 400W | 0MiB / 81920MiB | N/A      Default |
    |  |  | Enabled |
    +-----------------------------------------+----------------------+----------------------+
    | 1  NVIDIA A100-SXM4-80GB          On | 00000000:00:06.0 Off | 0 |
    | N/A   36C    P0              61W / 400W | 0MiB / 81920MiB | 0%      Default |
    |  |  | Disabled |
    +-----------------------------------------+----------------------+----------------------+
    | 2  NVIDIA A100-SXM4-80GB          On | 00000000:00:07.0 Off | 0 |
    | N/A   36C    P0              64W / 400W | 0MiB / 81920MiB | 0%      Default |
    |  |  | Disabled |
    +-----------------------------------------+----------------------+----------------------+
    | 3  NVIDIA A100-SXM4-80GB          On | 00000000:00:08.0 Off | 0 |
    | N/A   40C    P0              64W / 400W | 0MiB / 81920MiB | 0%      Default |
    |  |  | Disabled |
    +-----------------------------------------+----------------------+----------------------+
    | 4  NVIDIA A100-SXM4-80GB          On | 00000000:00:09.0 Off | 0 |
    | N/A   36C    P0              63W / 400W | 0MiB / 81920MiB | 0%      Default |
    |  |  | Disabled |
    +-----------------------------------------+----------------------+----------------------+
    | 5  NVIDIA A100-SXM4-80GB          On | 00000000:00:0A.0 Off | 0 |
    | N/A   40C    P0              64W / 400W | 0MiB / 81920MiB | 0%      Default |
    |  |  | Disabled |
    +-----------------------------------------+----------------------+----------------------+
    | 6  NVIDIA A100-SXM4-80GB          On | 00000000:00:0B.0 Off | 0 |
    | N/A   39C    P0              65W / 400W | 0MiB / 81920MiB | 0%      Default |
    |  |  | Disabled |
    +-----------------------------------------+----------------------+----------------------+
    | 7  NVIDIA A100-SXM4-80GB          On | 00000000:00:0C.0 Off | 0 |
    | N/A   39C    P0              60W / 400W | 0MiB / 81920MiB | 0%      Default |
    |  |  | Disabled |
    +-----------------------------------------+----------------------+----------------------+
    
    +---------------------------------------------------------------------------------------+
    | MIG devices: |
    +------------------+--------------------------------+-----------+-----------------------+
    | GPU  GI  CI  MIG | Memory-Usage | Vol | Shared |
    | ID  ID  Dev | BAR1-Usage | SM     Unc | CE ENC DEC OFA JPG |
    |  |  | ECC |  |
    | ==================+================================+===========+======================= |
    | No MIG devices found |
    +---------------------------------------------------------------------------------------+
    
    +---------------------------------------------------------------------------------------+
    | Processes: |
    | GPU   GI   CI        PID   Type   Process name                            GPU Memory |
    | ID   ID                                                             Usage |
    | ======================================================================================= |
    | No running processes found |
    +---------------------------------------------------------------------------------------+
    GPU driver status example

  2. Check the NVSwitch and NVLink hardware status.

    • Check NVSwitch status

      Color mode
      ~$ nvidia-smi nvlink --status
      ~$ nvidia-smi nvlink --status
      NVSwitch status check example
      Color mode
      GPU 1: NVIDIA A100-SXM4-80GB (UUID: GPU-64a2f685-bb12-c4af-105c-0726ece9c8d7)
               Link 0: 25 GB/s
               Link 1: 25 GB/s
               Link 2: 25 GB/s
               Link 3: 25 GB/s
               Link 4: 25 GB/s
               Link 5: 25 GB/s
               Link 6: 25 GB/s
               Link 7: 25 GB/s
               Link 8: 25 GB/s
               Link 9: 25 GB/s
               Link 10: 25 GB/s
               Link 11: 25 GB/s
      GPU 2: NVIDIA A100-SXM4-80GB (UUID: GPU-2269851b-71cd-f6c7-50c5-ba1525cf3ce8)
               Link 0: 25 GB/s
               Link 1: 25 GB/s
               Link 2: 25 GB/s
               Link 3: 25 GB/s
               Link 4: 25 GB/s
               Link 5: 25 GB/s
               Link 6: 25 GB/s
               Link 7: 25 GB/s
               Link 8: 25 GB/s
               Link 9: 25 GB/s
               Link 10: 25 GB/s
               Link 11: 25 GB/s
      GPU 3: NVIDIA A100-SXM4-80GB (UUID: GPU-4c397bbf-95fc-5c29-918a-a429cbe45a7a)
               Link 0: 25 GB/s
               Link 1: 25 GB/s
               Link 2: 25 GB/s
               Link 3: 25 GB/s
               Link 4: 25 GB/s
               Link 5: 25 GB/s
               Link 6: 25 GB/s
               Link 7: 25 GB/s
               Link 8: 25 GB/s
               Link 9: 25 GB/s
               Link 10: 25 GB/s
               Link 11: 25 GB/s
      GPU 4: NVIDIA A100-SXM4-80GB (UUID: GPU-0e350204-9fb6-2cbe-538e-8f7849658eb8)
               Link 0: 25 GB/s
               Link 1: 25 GB/s
               Link 2: 25 GB/s
               Link 3: 25 GB/s
               Link 4: 25 GB/s
               Link 5: 25 GB/s
               Link 6: 25 GB/s
               Link 7: 25 GB/s
               Link 8: 25 GB/s
               Link 9: 25 GB/s
               Link 10: 25 GB/s
               Link 11: 25 GB/s
      GPU 5: NVIDIA A100-SXM4-80GB (UUID: GPU-45f0c453-4760-edd4-3af9-25c5ea7473a5)
               Link 0: 25 GB/s
               Link 1: 25 GB/s
               Link 2: 25 GB/s
               Link 3: 25 GB/s
               Link 4: 25 GB/s
               Link 5: 25 GB/s
               Link 6: 25 GB/s
               Link 7: 25 GB/s
               Link 8: 25 GB/s
               Link 9: 25 GB/s
               Link 10: 25 GB/s
               Link 11: 25 GB/s
      GPU 6: NVIDIA A100-SXM4-80GB (UUID: GPU-38409794-bb34-430e-3c50-90b42cb2bb72)
               Link 0: 25 GB/s
               Link 1: 25 GB/s
               Link 2: 25 GB/s
               Link 3: 25 GB/s
               Link 4: 25 GB/s
               Link 5: 25 GB/s
               Link 6: 25 GB/s
               Link 7: 25 GB/s
               Link 8: 25 GB/s
               Link 9: 25 GB/s
               Link 10: 25 GB/s
               Link 11: 25 GB/s
      GPU 7: NVIDIA A100-SXM4-80GB (UUID: GPU-3fb478aa-801b-eb64-55c2-0ffc3f2ce404)
               Link 0: 25 GB/s
               Link 1: 25 GB/s
               Link 2: 25 GB/s
               Link 3: 25 GB/s
               Link 4: 25 GB/s
               Link 5: 25 GB/s
               Link 6: 25 GB/s
               Link 7: 25 GB/s
               Link 8: 25 GB/s
               Link 9: 25 GB/s
               Link 10: 25 GB/s
               Link 11: 25 GB/s
      GPU 1: NVIDIA A100-SXM4-80GB (UUID: GPU-64a2f685-bb12-c4af-105c-0726ece9c8d7)
               Link 0: 25 GB/s
               Link 1: 25 GB/s
               Link 2: 25 GB/s
               Link 3: 25 GB/s
               Link 4: 25 GB/s
               Link 5: 25 GB/s
               Link 6: 25 GB/s
               Link 7: 25 GB/s
               Link 8: 25 GB/s
               Link 9: 25 GB/s
               Link 10: 25 GB/s
               Link 11: 25 GB/s
      GPU 2: NVIDIA A100-SXM4-80GB (UUID: GPU-2269851b-71cd-f6c7-50c5-ba1525cf3ce8)
               Link 0: 25 GB/s
               Link 1: 25 GB/s
               Link 2: 25 GB/s
               Link 3: 25 GB/s
               Link 4: 25 GB/s
               Link 5: 25 GB/s
               Link 6: 25 GB/s
               Link 7: 25 GB/s
               Link 8: 25 GB/s
               Link 9: 25 GB/s
               Link 10: 25 GB/s
               Link 11: 25 GB/s
      GPU 3: NVIDIA A100-SXM4-80GB (UUID: GPU-4c397bbf-95fc-5c29-918a-a429cbe45a7a)
               Link 0: 25 GB/s
               Link 1: 25 GB/s
               Link 2: 25 GB/s
               Link 3: 25 GB/s
               Link 4: 25 GB/s
               Link 5: 25 GB/s
               Link 6: 25 GB/s
               Link 7: 25 GB/s
               Link 8: 25 GB/s
               Link 9: 25 GB/s
               Link 10: 25 GB/s
               Link 11: 25 GB/s
      GPU 4: NVIDIA A100-SXM4-80GB (UUID: GPU-0e350204-9fb6-2cbe-538e-8f7849658eb8)
               Link 0: 25 GB/s
               Link 1: 25 GB/s
               Link 2: 25 GB/s
               Link 3: 25 GB/s
               Link 4: 25 GB/s
               Link 5: 25 GB/s
               Link 6: 25 GB/s
               Link 7: 25 GB/s
               Link 8: 25 GB/s
               Link 9: 25 GB/s
               Link 10: 25 GB/s
               Link 11: 25 GB/s
      GPU 5: NVIDIA A100-SXM4-80GB (UUID: GPU-45f0c453-4760-edd4-3af9-25c5ea7473a5)
               Link 0: 25 GB/s
               Link 1: 25 GB/s
               Link 2: 25 GB/s
               Link 3: 25 GB/s
               Link 4: 25 GB/s
               Link 5: 25 GB/s
               Link 6: 25 GB/s
               Link 7: 25 GB/s
               Link 8: 25 GB/s
               Link 9: 25 GB/s
               Link 10: 25 GB/s
               Link 11: 25 GB/s
      GPU 6: NVIDIA A100-SXM4-80GB (UUID: GPU-38409794-bb34-430e-3c50-90b42cb2bb72)
               Link 0: 25 GB/s
               Link 1: 25 GB/s
               Link 2: 25 GB/s
               Link 3: 25 GB/s
               Link 4: 25 GB/s
               Link 5: 25 GB/s
               Link 6: 25 GB/s
               Link 7: 25 GB/s
               Link 8: 25 GB/s
               Link 9: 25 GB/s
               Link 10: 25 GB/s
               Link 11: 25 GB/s
      GPU 7: NVIDIA A100-SXM4-80GB (UUID: GPU-3fb478aa-801b-eb64-55c2-0ffc3f2ce404)
               Link 0: 25 GB/s
               Link 1: 25 GB/s
               Link 2: 25 GB/s
               Link 3: 25 GB/s
               Link 4: 25 GB/s
               Link 5: 25 GB/s
               Link 6: 25 GB/s
               Link 7: 25 GB/s
               Link 8: 25 GB/s
               Link 9: 25 GB/s
               Link 10: 25 GB/s
               Link 11: 25 GB/s
      NVSwitch status example

    • Check NVLinks hardware status

      Color mode
      ~$ nvidia-smi topo -m
      ~$ nvidia-smi topo -m
      Example code for checking NVLink hardware status
      Color mode
              GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity   GPU NUMA ID
      GPU0     X      NV12    NV12    NV12    NV12    NV12    NV12    NV12    0-127   0-7             N/A
      GPU1    NV12	 X      NV12    NV12    NV12    NV12    NV12    NV12    0-127   0-7             N/A
      GPU2    NV12    NV12     X      NV12    NV12    NV12    NV12    NV12    0-127   0-7             N/A
      GPU3    NV12    NV12    NV12     X      NV12    NV12    NV12    NV12    0-127   0-7             N/A
      GPU4    NV12    NV12    NV12    NV12     X      NV12    NV12    NV12    0-127   0-7             N/A
      GPU5    NV12    NV12    NV12    NV12    NV12     X      NV12    NV12    0-127   0-7             N/A
      GPU6    NV12    NV12    NV12    NV12    NV12    NV12     X      NV12    0-127   0-7             N/A
      GPU7    NV12    NV12    NV12    NV12    NV12    NV12    NV12     X      0-127   0-7             N/A
      
      Legend:
      
        X    = Self
        SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
        NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
        PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
        PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
        PIX  = Connection traversing at most a single PCIe bridge
        NV#  = Connection traversing a bonded set of # NVLinks
              GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity   GPU NUMA ID
      GPU0     X      NV12    NV12    NV12    NV12    NV12    NV12    NV12    0-127   0-7             N/A
      GPU1    NV12	 X      NV12    NV12    NV12    NV12    NV12    NV12    0-127   0-7             N/A
      GPU2    NV12    NV12     X      NV12    NV12    NV12    NV12    NV12    0-127   0-7             N/A
      GPU3    NV12    NV12    NV12     X      NV12    NV12    NV12    NV12    0-127   0-7             N/A
      GPU4    NV12    NV12    NV12    NV12     X      NV12    NV12    NV12    0-127   0-7             N/A
      GPU5    NV12    NV12    NV12    NV12    NV12     X      NV12    NV12    0-127   0-7             N/A
      GPU6    NV12    NV12    NV12    NV12    NV12    NV12     X      NV12    0-127   0-7             N/A
      GPU7    NV12    NV12    NV12    NV12    NV12    NV12    NV12     X      0-127   0-7             N/A
      
      Legend:
      
        X    = Self
        SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
        NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
        PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
        PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
        PIX  = Connection traversing at most a single PCIe bridge
        NV#  = Connection traversing a bonded set of # NVLinks
      NVLink HW status check code example

  3. Check the InfiniBand (IB) HCA card hardware status and link.

    Color mode
    user@bm-dev-001:~$ ibdev2netdev -v
    user@bm-dev-001:~$ ibdev2netdev -v
    HW status check command example
    Color mode
    cat: /sys/class/infiniband/mlx5_0/device/vpd: Permission denied
    0000:45:00.0 mlx5_0 (MT4123 -            )                 fw 20.29.1016 port 1 (ACTIVE) ==> ibs18 (Down)
    cat: /sys/class/infiniband/mlx5_1/device/vpd: Permission denied
    0000:0e:00.0 mlx5_1 (MT4123 -            )                 fw 20.29.1016 port 1 (ACTIVE) ==> ibs17 (Down)
    cat: /sys/class/infiniband/mlx5_2/device/vpd: Permission denied
    0000:c5:00.0 mlx5_2 (MT4123 -            )                 fw 20.29.1016 port 1 (ACTIVE) ==> ibs20 (Down)
    cat: /sys/class/infiniband/mlx5_3/device/vpd: Permission denied
    0000:85:00.0 mlx5_3 (MT4123 -            )                 fw 20.29.1016 port 1 (ACTIVE) ==> ibs19 (Down)
    user@bm-dev-001:~$
    cat: /sys/class/infiniband/mlx5_0/device/vpd: Permission denied
    0000:45:00.0 mlx5_0 (MT4123 -            )                 fw 20.29.1016 port 1 (ACTIVE) ==> ibs18 (Down)
    cat: /sys/class/infiniband/mlx5_1/device/vpd: Permission denied
    0000:0e:00.0 mlx5_1 (MT4123 -            )                 fw 20.29.1016 port 1 (ACTIVE) ==> ibs17 (Down)
    cat: /sys/class/infiniband/mlx5_2/device/vpd: Permission denied
    0000:c5:00.0 mlx5_2 (MT4123 -            )                 fw 20.29.1016 port 1 (ACTIVE) ==> ibs20 (Down)
    cat: /sys/class/infiniband/mlx5_3/device/vpd: Permission denied
    0000:85:00.0 mlx5_3 (MT4123 -            )                 fw 20.29.1016 port 1 (ACTIVE) ==> ibs19 (Down)
    user@bm-dev-001:~$
    Example of HW status check result
    Color mode
    root@bm-dev-001:~# ibstat
    root@bm-dev-001:~# ibstat
    Example of link verification command
    Color mode
    CA 'mlx5_0'
            CA type: MT4123
            Number of ports: 1
            Firmware version: 20.29.1016
            Hardware version: 0
            Node GUID: 0x88e9a4ffff5060ac
            System image GUID: 0x88e9a4ffff5060ac
            Port 1:
                    State: Active
                    Physical state: LinkUp
                    Rate: 200
                    Base lid: 8
                    LMC: 0
                    SM lid: 1
                    Capability mask: 0x2651e848
                    Port GUID: 0x88e9a4ffff5060ac
                    Link layer: InfiniBand
    CA 'mlx5_1'
            CA type: MT4123
            Number of ports: 1
            Firmware version: 20.29.1016
            Hardware version: 0
            Node GUID: 0x88e9a4ffff504080
            System image GUID: 0x88e9a4ffff504080
            Port 1:
                    State: Active
                    Physical state: LinkUp
                    Rate: 200
                    Base lid: 5
                    LMC: 0
                    SM lid: 1
                    Capability mask: 0x2651e848
                    Port GUID: 0x88e9a4ffff504080
                    Link layer: InfiniBand
    CA 'mlx5_2'
            CA type: MT4123
            Number of ports: 1
            Firmware version: 20.29.1016
            Hardware version: 0
            Node GUID: 0x88e9a4ffff505038
            System image GUID: 0x88e9a4ffff505038
            Port 1:
                    State: Active
                    Physical state: LinkUp
                    Rate: 200
                    Base lid: 2
                    LMC: 0
                    SM lid: 1
                    Capability mask: 0x2651e848
                    Port GUID: 0x88e9a4ffff505038
                    Link layer: InfiniBand
    CA 'mlx5_3'
            CA type: MT4123
            Number of ports: 1
            Firmware version: 20.29.1016
            Hardware version: 0
            Node GUID: 0x88e9a4ffff504094
            System image GUID: 0x88e9a4ffff504094
            Port 1:
                    State: Active
                    Physical state: LinkUp
                    Rate: 200
                    Base lid: 7
                    LMC: 0
                    SM lid: 1
                    Capability mask: 0x2651e848
                    Port GUID: 0x88e9a4ffff504094
                    Link layer: InfiniBand
    CA 'mlx5_0'
            CA type: MT4123
            Number of ports: 1
            Firmware version: 20.29.1016
            Hardware version: 0
            Node GUID: 0x88e9a4ffff5060ac
            System image GUID: 0x88e9a4ffff5060ac
            Port 1:
                    State: Active
                    Physical state: LinkUp
                    Rate: 200
                    Base lid: 8
                    LMC: 0
                    SM lid: 1
                    Capability mask: 0x2651e848
                    Port GUID: 0x88e9a4ffff5060ac
                    Link layer: InfiniBand
    CA 'mlx5_1'
            CA type: MT4123
            Number of ports: 1
            Firmware version: 20.29.1016
            Hardware version: 0
            Node GUID: 0x88e9a4ffff504080
            System image GUID: 0x88e9a4ffff504080
            Port 1:
                    State: Active
                    Physical state: LinkUp
                    Rate: 200
                    Base lid: 5
                    LMC: 0
                    SM lid: 1
                    Capability mask: 0x2651e848
                    Port GUID: 0x88e9a4ffff504080
                    Link layer: InfiniBand
    CA 'mlx5_2'
            CA type: MT4123
            Number of ports: 1
            Firmware version: 20.29.1016
            Hardware version: 0
            Node GUID: 0x88e9a4ffff505038
            System image GUID: 0x88e9a4ffff505038
            Port 1:
                    State: Active
                    Physical state: LinkUp
                    Rate: 200
                    Base lid: 2
                    LMC: 0
                    SM lid: 1
                    Capability mask: 0x2651e848
                    Port GUID: 0x88e9a4ffff505038
                    Link layer: InfiniBand
    CA 'mlx5_3'
            CA type: MT4123
            Number of ports: 1
            Firmware version: 20.29.1016
            Hardware version: 0
            Node GUID: 0x88e9a4ffff504094
            System image GUID: 0x88e9a4ffff504094
            Port 1:
                    State: Active
                    Physical state: LinkUp
                    Rate: 200
                    Base lid: 7
                    LMC: 0
                    SM lid: 1
                    Capability mask: 0x2651e848
                    Port GUID: 0x88e9a4ffff504094
                    Link layer: InfiniBand
    Link verification result example

Check IB bandwidth communication

Check the IB bandwidth communication status (ib_send_bw) and follow the steps below to inspect the IaaS hardware level.

  1. Check the name of the IB HCA interface.

    • In the following example, IB ports: mlx5_0, mlx5_4, mlx_5_5, mlx5_8
      Color mode
      ~$ ibdev2netdev -v
      ~$ ibdev2netdev -v
      Example of checking IB HCA interface name
      Color mode
      0000:1a:00.0 mlx5_0 (MT4129 - 1028SN     ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (ACTIVE) ==> ibp26s0 (Down)
      0000:1b:00.0 mlx5_1 (MT4123 - 1028SN     ) Mellanox ConnectX-6 Single Port VPI HDR QSFP Adapter fw 20.38.1002 port 1 (ACTIVE) ==> bond-nas (Up)
      0000:3c:00.0 mlx5_2 (MT4129 - 1028SN     ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (DOWN  ) ==> ibp60s0 (Down)
      0000:4d:00.0 mlx5_3 (MT4129 - 1028SN     ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (DOWN  ) ==> ibp77s0 (Down)
      0000:5e:00.0 mlx5_4 (MT4129 - 1028SN     ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (ACTIVE) ==> ibp94s0 (Down)
      0000:9c:00.0 mlx5_5 (MT4129 - 1028SN     ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (ACTIVE) ==> ibp156s0 (Down)
      0000:9d:00.0 mlx5_6 (MT4123 - 1028SN     ) Mellanox ConnectX-6 Single Port VPI HDR QSFP Adapter fw 20.38.1002 port 1 (ACTIVE) ==> bond-nas (Up)
      0000:bc:00.0 mlx5_7 (MT4129 - 1028SN     ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (DOWN  ) ==> ibp188s0 (Down)
      0000:cc:00.0 mlx5_8 (MT4129 - 1028SN     ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (ACTIVE) ==> ibp204s0 (Down)
      0000:dc:00.0 mlx5_9 (MT4129 - 1028SN     ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (DOWN  ) ==> ibp220s0 (Down)
      0000:1a:00.0 mlx5_0 (MT4129 - 1028SN     ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (ACTIVE) ==> ibp26s0 (Down)
      0000:1b:00.0 mlx5_1 (MT4123 - 1028SN     ) Mellanox ConnectX-6 Single Port VPI HDR QSFP Adapter fw 20.38.1002 port 1 (ACTIVE) ==> bond-nas (Up)
      0000:3c:00.0 mlx5_2 (MT4129 - 1028SN     ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (DOWN  ) ==> ibp60s0 (Down)
      0000:4d:00.0 mlx5_3 (MT4129 - 1028SN     ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (DOWN  ) ==> ibp77s0 (Down)
      0000:5e:00.0 mlx5_4 (MT4129 - 1028SN     ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (ACTIVE) ==> ibp94s0 (Down)
      0000:9c:00.0 mlx5_5 (MT4129 - 1028SN     ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (ACTIVE) ==> ibp156s0 (Down)
      0000:9d:00.0 mlx5_6 (MT4123 - 1028SN     ) Mellanox ConnectX-6 Single Port VPI HDR QSFP Adapter fw 20.38.1002 port 1 (ACTIVE) ==> bond-nas (Up)
      0000:bc:00.0 mlx5_7 (MT4129 - 1028SN     ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (DOWN  ) ==> ibp188s0 (Down)
      0000:cc:00.0 mlx5_8 (MT4129 - 1028SN     ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (ACTIVE) ==> ibp204s0 (Down)
      0000:dc:00.0 mlx5_9 (MT4129 - 1028SN     ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (DOWN  ) ==> ibp220s0 (Down)
      Example of IB HCA interface name verification result
  2. Use the SERVER Side command to check the communication status.

    Color mode
    ~$ ib_send_bw -d mlx5_0 -i 1 –F
    ~$ ib_send_bw -d mlx5_0 -i 1 –F
    SERVER Side command example
    Color mode
    ************************************
    * Waiting for client to connect... * *
    ************************************
    ---------------------------------------------------------------------------------------
                        Send BW Test
     Dual-port       : OFF		Device         : mlx5_0
     Number of qps   : 1		Transport type : IB
     Connection type : RC		Using SRQ      : OFF
     PCIe relax order: ON
     ibv_wr* API     : ON
     TX depth        : 128
     CQ Moderation   : 100
     Mtu             : 4096[B]
     Link type       : IB
     Max inline data : 0[B]
     rdma_cm QPs	 : OFF
     Data ex.  method : Ethernet
    ---------------------------------------------------------------------------------------
     local address: LID 0x0f QPN 0x6d95 PSN 0xb974a
     remote address: LID 0x01 QPN 0x6dd2 PSN 0xc8a18c
    ---------------------------------------------------------------------------------------
     #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
     65536      1000             0.00               19827.40                   0.317238
    ---------------------------------------------------------------------------------------
    ************************************
    * Waiting for client to connect... * *
    ************************************
    ---------------------------------------------------------------------------------------
                        Send BW Test
     Dual-port       : OFF		Device         : mlx5_0
     Number of qps   : 1		Transport type : IB
     Connection type : RC		Using SRQ      : OFF
     PCIe relax order: ON
     ibv_wr* API     : ON
     TX depth        : 128
     CQ Moderation   : 100
     Mtu             : 4096[B]
     Link type       : IB
     Max inline data : 0[B]
     rdma_cm QPs	 : OFF
     Data ex.  method : Ethernet
    ---------------------------------------------------------------------------------------
     local address: LID 0x0f QPN 0x6d95 PSN 0xb974a
     remote address: LID 0x01 QPN 0x6dd2 PSN 0xc8a18c
    ---------------------------------------------------------------------------------------
     #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
     65536      1000             0.00               19827.40                   0.317238
    ---------------------------------------------------------------------------------------
    Example of mutual communication status check result

  3. Use the CLIENT Side command to check the communication status.

    Color mode
    ~$ ib_send_bw -d mlx5_0 -i 1 -F <SERVER Side IP>
    ~$ ib_send_bw -d mlx5_0 -i 1 -F <SERVER Side IP>
    CLIENT Side command example
    Color mode
    ---------------------------------------------------------------------------------------
                        Send BW Test
     Dual-port       : OFF		Device         : mlx5_0
     Number of qps   : 1		Transport type : IB
     Connection type : RC		Using SRQ      : OFF
     PCIe relax order: ON
     ibv_wr* API     : ON
     RX depth        : 512
     CQ Moderation   : 100
     Mtu             : 4096[B]
     Link type       : IB
     Max inline data : 0[B]
     rdma_cm QPs	 : OFF
     Data ex.  method : Ethernet
    ---------------------------------------------------------------------------------------
     local address: LID 0x01 QPN 0x6dd2 PSN 0xc8a18c
     remote address: LID 0x0f QPN 0x6d95 PSN 0xb974a
    ---------------------------------------------------------------------------------------
     #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
     65536      1000             19008.49            19006.37                  0.304102
    ---------------------------------------------------------------------------------------
    ---------------------------------------------------------------------------------------
                        Send BW Test
     Dual-port       : OFF		Device         : mlx5_0
     Number of qps   : 1		Transport type : IB
     Connection type : RC		Using SRQ      : OFF
     PCIe relax order: ON
     ibv_wr* API     : ON
     RX depth        : 512
     CQ Moderation   : 100
     Mtu             : 4096[B]
     Link type       : IB
     Max inline data : 0[B]
     rdma_cm QPs	 : OFF
     Data ex.  method : Ethernet
    ---------------------------------------------------------------------------------------
     local address: LID 0x01 QPN 0x6dd2 PSN 0xc8a18c
     remote address: LID 0x0f QPN 0x6d95 PSN 0xb974a
    ---------------------------------------------------------------------------------------
     #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
     65536      1000             19008.49            19006.37                  0.304102
    ---------------------------------------------------------------------------------------
    Example of mutual communication status check result

Check IB service related kernel modules

Check the relevant kernel modules for the IB service (lsmod) to inspect the IaaS hardware level.

  • Check IB service-related kernel module - nvidia_peermem

    Color mode
    ~$ lsmod | grep nvidia_peermem
    ~$ lsmod | grep nvidia_peermem
    Example command to check nvidia_peermem
    Color mode
    nvidia_peermem         16384  0
    ib_core               425984  9 rdma_cm,ib_ipoib,nvidia_peermem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
    nvidia              56524800  451 nvidia_uvm,nvidia_peermem,nvidia_modeset
    nvidia_peermem         16384  0
    ib_core               425984  9 rdma_cm,ib_ipoib,nvidia_peermem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
    nvidia              56524800  451 nvidia_uvm,nvidia_peermem,nvidia_modeset
    Example of nvidia_peermem check result

  • Check IB service related kernel modules - IB Card (HCA) driver

    Color mode
    ~$ lsmod|egrep 'ib_|_ib|mlx'
    ~$ lsmod|egrep 'ib_|_ib|mlx'
    Example command to check the IB Card (HCA) driver
    Color mode
    mlx5_ib               393216  0
    ib_uverbs             163840  2 irdma,mlx5_ib
    ib_core               393216  3 irdma,ib_uverbs,mlx5_ib
    mlx5_core            1593344  1 mlx5_ib
    mlxfw                  32768  1 mlx5_core
    psample                20480  1 mlx5_core
    tls                   114688  1 mlx5_core
    pci_hyperv_intf        16384  1 mlx5_core
    ib_ipoib              139264  0
    ib_cm                 131072  2 rdma_cm,ib_ipoib
    ib_umad                40960  0
    mlx5_ib               454656  0
    ib_uverbs             135168  2 rdma_ucm,mlx5_ib
    ib_core               434176  9 rdma_cm,ib_ipoib,nvidia_peermem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
    libcrc32c              16384  5 nf_conntrack,nf_nat,btrfs,nf_tables,raid456
    mlx5_core            2064384  1 mlx5_ib
    mlx_compat             69632  11 rdma_cm,ib_ipoib,mlxdevm,iw_cm,ib_umad,ib_core,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm,mlx5_core
    mlx5_ib               393216  0
    ib_uverbs             163840  2 irdma,mlx5_ib
    ib_core               393216  3 irdma,ib_uverbs,mlx5_ib
    mlx5_core            1593344  1 mlx5_ib
    mlxfw                  32768  1 mlx5_core
    psample                20480  1 mlx5_core
    tls                   114688  1 mlx5_core
    pci_hyperv_intf        16384  1 mlx5_core
    ib_ipoib              139264  0
    ib_cm                 131072  2 rdma_cm,ib_ipoib
    ib_umad                40960  0
    mlx5_ib               454656  0
    ib_uverbs             135168  2 rdma_ucm,mlx5_ib
    ib_core               434176  9 rdma_cm,ib_ipoib,nvidia_peermem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
    libcrc32c              16384  5 nf_conntrack,nf_nat,btrfs,nf_tables,raid456
    mlx5_core            2064384  1 mlx5_ib
    mlx_compat             69632  11 rdma_cm,ib_ipoib,mlxdevm,iw_cm,ib_umad,ib_core,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm,mlx5_core
    Example of IB Card (HCA) driver verification result

Check storage physical disk resources and Multi-Path

Verify the storage physical disk resources and Multi-Path to assess the IaaS hardware level.

  • Storage Physical Disk Resource Check Results

    Color mode
    root@bm-dev-002:/tmp# fdisk –l
    root@bm-dev-002:/tmp# fdisk –l
    Example of storage physical disk resource check result

  • Multi-Path verification result

    Color mode
    root@bm-dev-002:/tmp# multipath –ll
    root@bm-dev-002:/tmp# multipath –ll
    Example of Multi-Path verification result

Check Service Network after new deployment of Multi-node GPU Cluster

Use the following command to verify that the MII Status of Bonding and Slave Interface is up.

  • Service Network check command

    Color mode
    ~$ cat /proc/net/bonding/bond-srv
    ~$ cat /proc/net/bonding/bond-srv
    Example command to check Service Network

  • Service Network check result

    Color mode
    Ethernet Channel Bonding Driver: v5.15.0-25-generic
    Bonding Mode: fault-tolerance (active-backup)
    Primary Slave: None
    Currently Active Slave: ens9f0
    MII Status: up
    MII Polling Interval (ms): 100
    Up Delay (ms): 0
    Down Delay (ms): 0
    Peer Notification Delay (ms): 0
    
    Slave Interface: ens9f0
    MII Status: up
    Speed: 100000 Mbps
    Duplex: full
    Link Failure Count: 0
    Permanent HW addr: 30:3e:a7:02:35:70
    Slave queue ID: 0
    
    Slave Interface: ens11f0
    MII Status: up
    Speed: 100000 Mbps
    Duplex: full
    Link Failure Count: 0
    Permanent HW addr: 30:3e:a7:02:2f:e8
    Slave queue ID: 0
    Ethernet Channel Bonding Driver: v5.15.0-25-generic
    Bonding Mode: fault-tolerance (active-backup)
    Primary Slave: None
    Currently Active Slave: ens9f0
    MII Status: up
    MII Polling Interval (ms): 100
    Up Delay (ms): 0
    Down Delay (ms): 0
    Peer Notification Delay (ms): 0
    
    Slave Interface: ens9f0
    MII Status: up
    Speed: 100000 Mbps
    Duplex: full
    Link Failure Count: 0
    Permanent HW addr: 30:3e:a7:02:35:70
    Slave queue ID: 0
    
    Slave Interface: ens11f0
    MII Status: up
    Speed: 100000 Mbps
    Duplex: full
    Link Failure Count: 0
    Permanent HW addr: 30:3e:a7:02:2f:e8
    Slave queue ID: 0
    Example of Service Network verification result

Reference
If some Slave Interface is in a down state, use the Contact of the Support Center to report the abnormal situation and receive a response.

Multi-node GPU Cluster: Verify time synchronization with the Time Server after new deployment

The OS image includes the installation of the chrony daemon and configuration for SCP NTP server synchronization. Use the following command to verify whether a line marked with ^* exists in the MS Name column.

  • Command to check the chrony daemon source and synchronization status

    Color mode
    ~$ chronyc sources -V
    ~$ chronyc sources -V
    Example of status check command

  • Result of checking chrony daemon status

    Color mode
    MS Name/IP address         Stratum Poll Reach LastRx Last sample
    ===============================================================================
    ^* 198.19.0.54                   4  10   377  1040    -16us[  -37us] +/- 9982us
    MS Name/IP address         Stratum Poll Reach LastRx Last sample
    ===============================================================================
    ^* 198.19.0.54                   4  10   377  1040    -16us[  -37us] +/- 9982us
    Example of checking the chrony daemon status

Install ServiceWatch Agent
Release Note