This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

How-to guides

Users can create the service by entering the required information for the Multi-node GPU Cluster service and selecting detailed options through the Samsung Cloud Platform Console.

Multi-node GPU Cluster Getting Started

You can create and use a Multi-node GPU Cluster service in the Samsung Cloud Platform Console.

This service consists of a GPU Node and a Cluster Fabric service.

Create GPU Node

Multi-node GPU Cluster To create a Multi-node GPU Cluster, follow the steps below.

  1. All Services > Compute > Multi-node GPU Cluster menu, click it. Navigate to the Service Home page of Multi-node GPU Cluster.
  2. On the Service Home page, click the Create GPU Node button. You will be taken to the Create GPU Node page.
  3. On the GPU Node Creation page, enter the information required to create the service and select detailed options.
    • Select the required information in the Image and Version Selection area.
      Category
      required or not
      Detailed description
      imageRequiredSelect the type of image provided
      • Ubuntu
      Image versionRequiredSelect version of the chosen image
      • Provide a list of versions for the supplied server image
      Table. GPU Node image and version selection options
    • In the Service Information Input area, enter or select the required information.
      Category
      Required
      Detailed description
      Number of serversRequiredNumber of GPU Node servers to create simultaneously
      • Only numeric input is allowed, and the minimum number of servers to create is 2.
      • Only during the initial setup can you create 2 or more, and subsequent expansions can be done one at a time.
      Service Type > Server TypeRequiredGPU Node server type
      • Select the desired CPU, Memory, GPU, and Disk specifications
      Service Type > Planned ComputeRequiredStatus of resources with Planned Compute configured
      • In Use: Number of resources with Planned Compute that are currently in use
      • Configured: Number of resources with Planned Compute configured
      • Coverage Preview: Amount applied per resource by Planned Compute
      • Apply for Planned Compute Service: Navigate to the Planned Compute service application page
      Table. GPU Node Service Information Input Items
    • In the Required Information Input area, enter or select the necessary information.
      Category
      Required
      Detailed description
      Administrator accountRequiredSet the administrator account and password to be used when connecting to the server
      • Ubuntu OS is provided with root fixed
      Server name PrefixRequiredEnter a Prefix to distinguish each GPU Node generated when the selected number of servers is 2 or more
      • Automatically generated in the form of user input value (prefix) + ‘-###
      • Must start with a lowercase English letter and be entered using lowercase letters, numbers, and special characters (-) within 3 to 11 characters
      • Must not end with a special character (-)
      Network SettingsRequiredSet the network where the GPU Node will be installed
      • VPC name: select a pre‑created VPC
      • General Subnet name: select a pre‑created general Subnet
        • IP can be auto‑generated or manually entered; if manual input is chosen, the user enters the IP directly
      • NAT: usable only when there is a single server and the VPC is attached to an Internet Gateway. Check to use, and you can select a NAT IP (initially it can be created only with two or more servers, so modify it on the resource detail page)
      • NAT IP: select a NAT IP
        • If no NAT IP is available, click the Create new button to generate a Public IP
        • Click the Refresh button to view and select the created Public IP
        • Creating a Public IP incurs charges according to the Public IP pricing
      Table. Required input fields for GPU Node
    • In the Cluster Selection area, create or select a Cluster Fabric.
      Category
      Required status
      Detailed description
      Cluster FabricRequiredConfigure a group of GPU Node servers that can apply GPU Direct RDMA together
      • Optimal GPU performance and speed can be achieved only within the same Cluster Fabric
      • When creating a new Cluster Fabric, *New Input > select Node pool, then enter the name of the Cluster Fabric to create
      • To add to an existing Cluster Fabric, Existing Input > select Node pool, then select the previously created Cluster Fabric
      Table. GPU Node Cluster Fabric selection items
    • Additional Information Input area, enter or select the required information.
      Category
      Required status
      Detailed description
      LockSelectionUsing a lock prevents accidental actions that could terminate, start, or stop the server.
      Init ScriptSelectionScript to run when the server starts
      • The Init Script must be selected differently depending on the image type
        • For Linux: Choose Shell Script or cloud-init
      tagSelectionAdd Tag
      • Up to 50 can be added per resource
      • After clicking the Add Tag button, enter or select Key, Value values
      Table. GPU Node additional information input fields
  4. Summary Check the detailed information and estimated billing amount generated in the panel, and click the Create button.
    • When creation is complete, check the created resources on the GPU Node List page.
Caution
  • When creating a service, the GPU MIG/ECC settings are reset. However, to ensure the correct settings are applied, perform an initial reboot, verify that the settings have taken effect, and then proceed.
  • For detailed information on resetting GPU MIG/ECC settings, refer to the GPU MIG/ECC Reset Settings Checklist Guide.

Check GPU Node details

The Multi-node GPU Cluster service allows you to view and edit the full resource list and detailed information of GPU nodes.

GPU Node Details page consists of Details, Tags, Job History tabs.

To view detailed information about the GPU Node, follow these steps.

  1. All Services > Compute > Multi-node GPU Cluster > GPU Node menu, click it. Navigate to the Service Home page of Multi-node GPU Cluster.

  2. On the Service Home page, click the GPU Node menu. You will be taken to the GPU Node List page.

    • Resource items beyond the required columns can be added via the Settings button.
      Category
      Required status
      Detailed description
      Resource IDSelectionUser-created GPU Node ID
      Cluster Fabric nameRequiredUser-created Cluster Fabric name
      Server nameRequiredUser-created GPU Node name
      Server typeRequiredServer type of the GPU Node
      • Users can view the number of cores, memory capacity, and GPU type and quantity of the resources they created
      imageRequiredUser-created GPU Node image version
      IPRequiredIP of the GPU node created by the user
      statusRequiredStatus of the user-created GPU node
      Creation date and timeSelectionGPU Node creation timestamp
      Table. GPU Node resource list items
  3. On the GPU Node List page, click the resource for which you want to view detailed information. You will be taken to the GPU Node Details page.

    • GPU Server Details At the top of the page, status information and descriptions of additional features are displayed.
      CategoryDetailed description
      GPU Node statusStatus of user-created GPU Node
      • Creating: server is being created
      • Running:: creation completed and available
      • Editing:: IP is being changed
      • Unknown: error state
      • Starting: server is starting
      • Stopping: server is stopping
      • Stopped: server stopped
      • Terminating: terminating
      • Terminated: termination completed
      Server controlButton to change server status
      • Start: Start a stopped server
      • Stop: Stop a running server
      Service terminationCancel service button
      Table. GPU Node status information and additional features

Detailed Information

On the GPU Node List page’s Details Tab, you can view the detailed information of the selected resource and edit the information if necessary.

CategoryDetailed description
serviceService name
Resource TypeResource Type
SRNUnique resource ID in Samsung Cloud Platform
  • In a GPU Node, it refers to the GPU Node SRN
Resource nameResource Name
  • In the GPU Node service, it refers to the GPU Node name
Resource IDUnique resource ID in the service
constructorUser who created the service
Creation date and timeService creation date and time
editorUser who edited the service information
Modification dateDate and time the service information was modified
Server nameServer name
Node poolA collection of nodes that can be grouped into the same Cluster Fabric
Cluster Fabric nameUser-created Cluster Fabric name
Image/VersionServer OS image and version
Server typeCPU, memory, GPU, information display
Planned ComputeResource status with Planned Compute configured
LockDisplay whether Lock is enabled or disabled
  • When Lock is enabled, it prevents server termination/start/stop operations, avoiding actions caused by mistakes.
  • If you need to change the Lock attribute value, click the Edit button to configure.
NetworkGPU Node network information
  • VPC name, general Subnet name, IP, IP status, NAT IP, NAT IP status
Block StorageBlock Storage information attached to the server
  • Volume name, disk type, capacity, status
Init ScriptView the Init Script content entered when creating the server
Table. GPU Node detailed information tab items

tag

On the GPU Node List page’s Tag Tab, you can view the selected resource’s tag information, and add, modify, or delete it.

CategoryDetailed description
Tag listTag list
  • You can view the Key and Value information of the tag
  • Up to 50 tags can be added per resource
  • When entering a tag, you can search and select from the list of previously created Keys and Values
Table. GPU Node Tag Tab Items

Job History

On the GPU Node List page’s Job History Tab, you can view the job history of the selected resource.

CategoryDetailed description
Task History ListResource Change History
  • Check operation details, operation timestamp, resource type, resource name, event topic, operation result, and operator information
  • Detailed Search button provides detailed search functionality
Table. GPU Node Job History Tab Detailed Information Items

Control GPU Node Operation

If you need server control and management functions for the created GPU Node resources, you can perform tasks on the GPU Node List or GPU Node Details page. You can start and stop the resources of a running GPU node.

Getting Started with GPU Node

You can start a GPU Node that is stopped (Stopped). To start a GPU Node, follow the steps below.

  1. Click the All Services > Compute > Multi-node GPU Cluster menu. Go to the Service Home page of Multi-node GPU Cluster.
  2. On the Service Home page, click the GPU Node menu. You will be taken to the GPU Node List page.
    • On the GPU Node List page, after selecting individual or multiple servers with the checkboxes, you can Start using the More button at the top.
  3. On the GPU Node List page, click the resource. Navigate to the GPU Node Detail page.
    • On the GPU Node Details page, click the Start button at the top to start the server.
  4. Check the server status and complete the status change.

Stopping GPU Node

You can stop a GPU Node that is running (Active). To stop a GPU Node, follow the steps below.

  1. Click the All Services > Compute > Multi-node GPU Cluster menu. Go to the Service Home page of Multi-node GPU Cluster.
  2. On the Service Home page, click the GPU Node menu. You will be taken to the GPU Node List page.
    • GPU Node List page allows you to control individual or multiple servers by selecting the checkboxes and using the Stop button at the top.
  3. On the GPU Node List page, click the resource. You will be taken to the GPU Node Details page.
    • On the GPU Node Details page, click the Stop button at the top to stop the server.
  4. Check the server status and complete the status change.

Terminate GPU Node

You can terminate unused GPU Nodes to reduce operating costs. However, terminating a service may cause the running service to stop immediately, so you should carefully consider the impact of service interruption before proceeding with termination.

Caution
Please note that data cannot be recovered after terminating the service.

To cancel a GPU Node, follow the steps below.

  1. Click the All Services > Compute > Multi-node GPU Server menu. Navigate to the Service Home page of the Multi-node GPU Cluster.
  2. On the Service Home page, click the Cluster Fabric menu. You will be taken to the Cluster Fabric list page.
  3. Cluster Fabric List page, select the resources to terminate, and click the Service Termination button.
    • Resources that use the same Cluster Fabric can be terminated simultaneously.
  4. When termination is complete, check on the GPU Node List page whether the resources have been terminated.
information

The cases where a GPU Node cannot be terminated are as follows.

  • When Block Storage(BM) is connected: Please disconnect the Block Storage(BM) connection first.
  • If File Storage is connected: Please disconnect the File Storage first.
  • When Lock is set: Please change the Lock setting to disabled and try again.
  • If the selection includes a server that cannot be terminated simultaneously: Please re-select only resources that can be terminated.
  • If the server you want to terminate has a different Cluster Fabric: Select only resources that use the same Cluster Fabric.
Reference
If all GPU Nodes in the Cluster Fabric are deleted, the Cluster Fabric is automatically deleted.

1 - Manage Cluster Fabric

Cluster Fabric is a service that helps manage the servers (GPU Node) included in a GPU Cluster. By using Cluster Fabric, you can move servers between GPU Clusters in the same Node pool and optimize GPU performance and speed within the same GPU Cluster.

Creating Cluster Fabric

Cluster Fabric can be created together with a GPU Node, and it cannot be created or deleted separately. If all GPU Nodes within a Cluster Fabric are terminated, the Cluster Fabric is automatically deleted.
If you have not created a GPU Node, please create a GPU Node first. For more information, see GPU Node 생성하기.

Check Cluster Fabric details

Notice
  • Cluster Fabric can be created together when a GPU node is created, and it cannot be created or deleted independently.
  • If all GPU nodes in the Cluster Fabric are terminated, the Cluster Fabric is automatically deleted.
  • If you have not created a GPU Node, please create a GPU Node first. For more details, refer to GPU Node 생성하기.

On the Cluster Fabric List page and the Cluster Fabric Details page, you can view the generated Cluster Fabric list and details and move the server.

  1. Click the All Services > Compute > Multi-node GPU Server menu. Go to the Service Home page of the Multi-node GPU Cluster.

  2. On the Service Home page, click the Cluster Fabric menu. You will be taken to the Cluster Fabric List page.

    • On the Cluster Fabric List page, you can view the resource list of GPU clusters created by the user.
    • Resource items beyond the required columns can be added via the Settings button.
      Category
      Required
      Detailed description
      Resource IDSelectionUser-created Cluster Fabric ID
      Cluster Fabric nameRequiredUser-created Cluster Fabric name
      Node poolSelectionA collection of nodes that can be grouped into the same Cluster Fabric
      Number of serversSelectionNumber of GPU Nodes
      Server typeSelectionServer type of GPU Node
      • Users can view the number of cores, memory capacity, and GPU type and count of the resources they created
      statusSelectionStatus of the user-created Cluster Fabric
      Creation date and timeSelectCluster Fabric creation timestamp
      Table. Cluster Fabric resource list items
  3. On the Cluster Fabric List page, click the resource to view detailed information. You will be taken to the Cluster Fabric Details page.

    • Cluster Fabric Details At the top of the page, status information and descriptions of additional features are displayed.
      CategoryDetailed description
      Cluster Fabric statusStatus of the user-created Cluster Fabric
      • Creating: State while the cluster is being created
      • Active: State when creation is complete and the cluster is usable
      • Editing: State while the IP is being changed
      • Deleting: State while being terminated
      • Deleted: State after termination is complete
      Add target serverA feature that allows moving a server from another cluster to the target cluster.
      Table. Cluster Fabric status information and additional features

Detailed Information

On the Cluster Fabric List page’s Details Tab, you can view detailed information of the selected resource and retrieve servers from another cluster.

CategoryDetailed description
serviceService name
Resource TypeResource Type
SRNUnique resource ID in Samsung Cloud Platform
  • In Cluster Fabric, it refers to the Cluster Fabric SRN
Resource nameResource name
  • In the Cluster Fabric service, it refers to the Cluster Fabric name
Resource IDUnique resource ID in the service
constructorUser who created the service
Creation date and timeService creation date and time
editorUser who edited the service information
Modification dateDate and time the service information was modified
Cluster Fabric nameUser-created Cluster Fabric name
Node poolA set of nodes that can be grouped into the same Cluster Fabric
target serverGPU Node list bound to Cluster Fabric
  • Server name, server type, IP, status
Table. Cluster Fabric detailed information tab items

Import Cluster Fabric Server

Cluster Fabric Details page’s add target server feature allows you to import servers from another cluster and add them to the selected cluster.

  1. Click the All Services > Compute > Multi-node GPU Server menu. Navigate to the Service Home page of the Multi-node GPU Cluster.
  2. On the Service Home page, click the Cluster Fabric menu. You will be taken to the Cluster Fabric list page.
  3. On the Cluster Fabric List page, click the resource to view detailed information. You will be taken to the Cluster Fabric Details page.
  4. In the target server of the Details tab, click the Add button on the right.
    • The add target server popup window opens.
      • Select a cluster from Cluster Fabric.
      • GPU nodes associated with the selected cluster are listed; select the GPU node you want to retrieve.
      • The selected GPU Node’s name is displayed at the bottom.
      • Press the Confirm button to complete.
      • Pressing the Cancel button cancels the operation.
    • Verify that the GPU node added on the target server is displayed.

Terminate Cluster Fabric

If all GPU Nodes in the Cluster Fabric are terminated, the Cluster Fabric is automatically deleted. For more information, see Terminate GPU Node.

2 - Install ServiceWatch Agent

Users can install the ServiceWatch Agent on GPU nodes of a Multi-node GPU Cluster to collect custom metrics and logs.

Reference
Collecting custom metrics/logs via the ServiceWatch Agent is currently available only on Samsung Cloud Platform For Enterprise. It will also be available in other offerings in the future.
Caution
Since metric collection through the ServiceWatch Agent is classified as custom metrics and incurs charges unlike the default collected metrics, it is recommended to remove or disable unnecessary metric collection settings.

ServiceWatch Agent

In a Multi-node GPU Cluster, the agents that need to be installed on GPU nodes to collect ServiceWatch custom metrics and logs can be divided into two main types. It is a Prometheus Exporter and Open Telemetry Collector.

CategoryDetailed description
Prometheus ExporterProvide metrics of a specific application or service in a format that Prometheus can scrape
  • For collecting OS metrics on a GPU Node, you can use the Node Exporter for Linux servers and the Windows Exporter for Windows servers, depending on the OS type.
Open Telemetry CollectorActs as a centralized collector that gathers telemetry data such as metrics and logs from distributed systems, processes (filtering, sampling, etc.) it, and exports it to multiple backends (e.g., Prometheus, Jaeger, Elasticsearch, etc.)
  • Exports data to the ServiceWatch Gateway so that ServiceWatch can collect metric and log data.
Table. Explanation of Prometheus Exporter and Open Telemetry Collector
information

If you have configured a Kubernetes Engine on a GPU node, please view the GPU metrics using the metrics provided by the Kubernetes Engine.

  • If you install the DCGM Exporter on a GPU node where Kubernetes Engine is configured, it may not operate correctly.
Reference
The ServiceWatch Agent guide for collecting GPU metrics on a GPU Node can be used the same as on a GPU Server. For more details, refer to GPU Server > ServiceWatch Agent.

Pre-configuration for Using ServiceWatch Agent

To use the ServiceWatch Agent, please refer to Prerequisite Settings for ServiceWatch Agent and prepare the prerequisite settings.

3 - Multi-node GPU Cluster Service Scope and Inspection Guide

Multi-node GPU Cluster Service Scope

If an IaaS hardware-level issue occurs with the Multi-node GPU Cluster service, you can receive technical support through Contact Us in the Support Center. However, the risks associated with changes such as OS kernel updates or application installations are the user’s responsibility, so technical support is limited; please be mindful when performing tasks such as system updates.

IaaS hardware level issue

  • HW fault event messages generated within the server by the IPMI hardware monitoring console.
  • GPU HW operation error observed in the nvdia-smi command
  • HW error messages that occur during inspection of InfiniBand HCA cards or InfiniBand Switches
Caution
Since the Multi-node GPU Cluster is a service sensitive to software version compatibility of Ubuntu OS / NVDIA / Infiniband, official technical support is unavailable after changes such as a user’s OS kernel update or application installation.

IaaS HW Inspection Guide

After applying for the Multi-node GPU Cluster service, it is recommended to check the IaaS HW level according to the inspection guide.

Intel E810 driver update

Check the version of the Intel E810 driver and, if necessary, refer to the following procedure to perform an update.

Caution
Proceed with the update only if the Multi-node GPU Cluster Node has an Intel E810 Device and is simultaneously using a standard image version 535.86.10 or lower.
Reference

You can use the lspci command to verify whether an E810 NIC device is present. If the E810 NIC is valid, the PCIe device is identified as follows using the E810-C information (if there is no E810 NIC device, this operation is not performed).

<div class="code-block-buttons">
  
  <button class="code-block-download">
    <div class="code-block-download-icon">
      <svg width="14" height="16" viewBox="0 0 14 16" fill="none" xmlns="http://www.w3.org/2000/svg"><path d="M13.999 14.4353v-2.998C13.999 11.0232 13.6602 10.6853 13.2461 10.6853S12.5 11.0232 12.5 11.4373v2.2519H1.5V11.4373C1.5 11.0232 1.16211 10.6853.748047 10.6853.333984 10.6853 976563e-9 11.0232 976562e-9 11.4373v2.998C976562e-9 14.8494.333984 15.1892.748047 15.1892H13.2461c.414099999999999.0.7529-.3398.7529-.7539z" fill="#5135ff"/><path d="M1.41169 6.21654c.25876-.32345.73073-.37589 1.05417-.11713l3.78463 3.0277V1.56104c0-.41422.33578-.750005.75-.750005.41421.0.75.335785.75.750005V9.12208L11.5288 6.09941C11.8523 5.84065 12.3242 5.89309 12.583 6.21654 12.8418 6.53999 12.7893 7.01196 12.4659 7.27071L6.99734 11.6455 1.52882 7.27071c-.32345-.25875-.37589-.73072-.11713-1.05417z" fill="#5135ff"/></svg></div>
    Sample Code Download
  </button>
  
  <button class="code-block-copy">
    <div class="code-block-copy-icon">
      <svg width="14" height="14" viewBox="0 0 14 14" fill="none" xmlns="http://www.w3.org/2000/svg"><path fill-rule="evenodd" clip-rule="evenodd" d="M10 4.00012V1c0-.552285-.44772-1-1-1H1C.447715.0.0.447715.0 1V9c0 .55228.447715 1 1 1H3.99988v3.0001C3.99988 13.5515 4.44849 14 4.99988 14H10.0002l3.9997-3.9995V5c0-.55139-.448600000000001-.99988-1-.99988H10zM1.4 1.4V8.6H3.99988V5c0-.55139.44861-.99988 1-.99988H8.6V1.4H1.4zM5.3999 12.6H9.08295V9.78301C9.08295 9.39641 9.39635 9.08301 9.78295 9.08301H12.5999V5.40015h-7.2V12.6zm6.1375-2.117-1.0545 1.0544V10.483h1.0545z" fill="currentcolor"/></svg></div>
    Copy Code
  </button>
</div>
Color mode
lspci | grep E810
0000:6a:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-C for QSFP (rev 02) 0000:6a:00.1 Ethernet controller: Intel Corporation Ethernet Controller E810-C for QSFP (rev 02)
lspci | grep E810
0000:6a:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-C for QSFP (rev 02) 0000:6a:00.1 Ethernet controller: Intel Corporation Ethernet Controller E810-C for QSFP (rev 02)
</div>
Example of checking for the presence of an E810 NIC device

To update the driver, follow the steps below.

  1. Move the base driver tar file to the desired directory.

    Color mode
    /usr/local/src
    /usr/local/src
    Directory navigation example

  2. Untar / unzip the Archiver file.

    • x.x.x is the version number of the driver tar file.
      Color mode
      tar zxf ice-x.x.x.tar.gz
      tar zxf ice-x.x.x.tar.gz
      untar example
  3. Change the driver to the src directory.

    • x.x.x is the version number of the driver tar file.
      Color mode
      cd ice-x.x.x/src/
      cd ice-x.x.x/src/
      Example of changing to the src directory
  4. Compile the driver module.

    Color mode
    make install
    make install
    Driver module compilation example

  5. After the update is complete, check the version.

    Color mode
    lsmod | grep ice
    modinfo ice | grep version
    lsmod | grep ice
    modinfo ice | grep version
    Version check example

Check NVIDIA driver

To check the NVIDIA driver (nvidia-smi topo, IB nv_peer_mem status) and inspect the IaaS hardware level, follow these steps.

Reference
The example of applying MIG is described using an A100 GPU node as the reference.
  1. Check the GPU driver status.

    Color mode
    ~$ nvidia-smi
    ~$ nvidia-smi
    Example code for checking GPU driver status
    Color mode
    Thu Jan 29 14:48:31 2026
    +---------------------------------------------------------------------------------------+
    | NVIDIA-SMI 535.183.06             Driver Version: 535.183.06   CUDA Version: 12.2 |
    |-----------------------------------------+----------------------+----------------------+
    | GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp   Perf          Pwr:Usage/Cap | Memory-Usage | GPU-Util  Compute M. |
    |  |  | MIG M. |
    | =========================================+======================+====================== |
    | 0  NVIDIA A100-SXM4-80GB          On | 00000000:00:05.0 Off | On |
    | N/A   36C    P0              52W / 400W | 0MiB / 81920MiB | N/A      Default |
    |  |  | Enabled |
    +-----------------------------------------+----------------------+----------------------+
    | 1  NVIDIA A100-SXM4-80GB          On | 00000000:00:06.0 Off | 0 |
    | N/A   36C    P0              61W / 400W | 0MiB / 81920MiB | 0%      Default |
    |  |  | Disabled |
    +-----------------------------------------+----------------------+----------------------+
    | 2  NVIDIA A100-SXM4-80GB          On | 00000000:00:07.0 Off | 0 |
    | N/A   36C    P0              64W / 400W | 0MiB / 81920MiB | 0%      Default |
    |  |  | Disabled |
    +-----------------------------------------+----------------------+----------------------+
    | 3  NVIDIA A100-SXM4-80GB          On | 00000000:00:08.0 Off | 0 |
    | N/A   40C    P0              64W / 400W | 0MiB / 81920MiB | 0%      Default |
    |  |  | Disabled |
    +-----------------------------------------+----------------------+----------------------+
    | 4  NVIDIA A100-SXM4-80GB          On | 00000000:00:09.0 Off | 0 |
    | N/A   36C    P0              63W / 400W | 0MiB / 81920MiB | 0%      Default |
    |  |  | Disabled |
    +-----------------------------------------+----------------------+----------------------+
    | 5  NVIDIA A100-SXM4-80GB          On | 00000000:00:0A.0 Off | 0 |
    | N/A   40C    P0              64W / 400W | 0MiB / 81920MiB | 0%      Default |
    |  |  | Disabled |
    +-----------------------------------------+----------------------+----------------------+
    | 6  NVIDIA A100-SXM4-80GB          On | 00000000:00:0B.0 Off | 0 |
    | N/A   39C    P0              65W / 400W | 0MiB / 81920MiB | 0%      Default |
    |  |  | Disabled |
    +-----------------------------------------+----------------------+----------------------+
    | 7  NVIDIA A100-SXM4-80GB          On | 00000000:00:0C.0 Off | 0 |
    | N/A   39C    P0              60W / 400W | 0MiB / 81920MiB | 0%      Default |
    |  |  | Disabled |
    +-----------------------------------------+----------------------+----------------------+
    
    +---------------------------------------------------------------------------------------+
    | MIG devices: |
    +------------------+--------------------------------+-----------+-----------------------+
    | GPU  GI  CI  MIG | Memory-Usage | Vol | Shared |
    | ID  ID  Dev | BAR1-Usage | SM     Unc | CE ENC DEC OFA JPG |
    |  |  | ECC |  |
    | ==================+================================+===========+======================= |
    | No MIG devices found |
    +---------------------------------------------------------------------------------------+
    
    +---------------------------------------------------------------------------------------+
    | Processes: |
    | GPU   GI   CI        PID   Type   Process name                            GPU Memory |
    | ID   ID                                                             Usage |
    | ======================================================================================= |
    | No running processes found |
    +---------------------------------------------------------------------------------------+
    Thu Jan 29 14:48:31 2026
    +---------------------------------------------------------------------------------------+
    | NVIDIA-SMI 535.183.06             Driver Version: 535.183.06   CUDA Version: 12.2 |
    |-----------------------------------------+----------------------+----------------------+
    | GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp   Perf          Pwr:Usage/Cap | Memory-Usage | GPU-Util  Compute M. |
    |  |  | MIG M. |
    | =========================================+======================+====================== |
    | 0  NVIDIA A100-SXM4-80GB          On | 00000000:00:05.0 Off | On |
    | N/A   36C    P0              52W / 400W | 0MiB / 81920MiB | N/A      Default |
    |  |  | Enabled |
    +-----------------------------------------+----------------------+----------------------+
    | 1  NVIDIA A100-SXM4-80GB          On | 00000000:00:06.0 Off | 0 |
    | N/A   36C    P0              61W / 400W | 0MiB / 81920MiB | 0%      Default |
    |  |  | Disabled |
    +-----------------------------------------+----------------------+----------------------+
    | 2  NVIDIA A100-SXM4-80GB          On | 00000000:00:07.0 Off | 0 |
    | N/A   36C    P0              64W / 400W | 0MiB / 81920MiB | 0%      Default |
    |  |  | Disabled |
    +-----------------------------------------+----------------------+----------------------+
    | 3  NVIDIA A100-SXM4-80GB          On | 00000000:00:08.0 Off | 0 |
    | N/A   40C    P0              64W / 400W | 0MiB / 81920MiB | 0%      Default |
    |  |  | Disabled |
    +-----------------------------------------+----------------------+----------------------+
    | 4  NVIDIA A100-SXM4-80GB          On | 00000000:00:09.0 Off | 0 |
    | N/A   36C    P0              63W / 400W | 0MiB / 81920MiB | 0%      Default |
    |  |  | Disabled |
    +-----------------------------------------+----------------------+----------------------+
    | 5  NVIDIA A100-SXM4-80GB          On | 00000000:00:0A.0 Off | 0 |
    | N/A   40C    P0              64W / 400W | 0MiB / 81920MiB | 0%      Default |
    |  |  | Disabled |
    +-----------------------------------------+----------------------+----------------------+
    | 6  NVIDIA A100-SXM4-80GB          On | 00000000:00:0B.0 Off | 0 |
    | N/A   39C    P0              65W / 400W | 0MiB / 81920MiB | 0%      Default |
    |  |  | Disabled |
    +-----------------------------------------+----------------------+----------------------+
    | 7  NVIDIA A100-SXM4-80GB          On | 00000000:00:0C.0 Off | 0 |
    | N/A   39C    P0              60W / 400W | 0MiB / 81920MiB | 0%      Default |
    |  |  | Disabled |
    +-----------------------------------------+----------------------+----------------------+
    
    +---------------------------------------------------------------------------------------+
    | MIG devices: |
    +------------------+--------------------------------+-----------+-----------------------+
    | GPU  GI  CI  MIG | Memory-Usage | Vol | Shared |
    | ID  ID  Dev | BAR1-Usage | SM     Unc | CE ENC DEC OFA JPG |
    |  |  | ECC |  |
    | ==================+================================+===========+======================= |
    | No MIG devices found |
    +---------------------------------------------------------------------------------------+
    
    +---------------------------------------------------------------------------------------+
    | Processes: |
    | GPU   GI   CI        PID   Type   Process name                            GPU Memory |
    | ID   ID                                                             Usage |
    | ======================================================================================= |
    | No running processes found |
    +---------------------------------------------------------------------------------------+
    GPU driver status example

  2. Check the NVSwitch and NVLink hardware status.

    • Check NVSwitch status

      Color mode
      ~$ nvidia-smi nvlink --status
      ~$ nvidia-smi nvlink --status
      NVSwitch status check example
      Color mode
      GPU 1: NVIDIA A100-SXM4-80GB (UUID: GPU-64a2f685-bb12-c4af-105c-0726ece9c8d7)
               Link 0: 25 GB/s
               Link 1: 25 GB/s
               Link 2: 25 GB/s
               Link 3: 25 GB/s
               Link 4: 25 GB/s
               Link 5: 25 GB/s
               Link 6: 25 GB/s
               Link 7: 25 GB/s
               Link 8: 25 GB/s
               Link 9: 25 GB/s
               Link 10: 25 GB/s
               Link 11: 25 GB/s
      GPU 2: NVIDIA A100-SXM4-80GB (UUID: GPU-2269851b-71cd-f6c7-50c5-ba1525cf3ce8)
               Link 0: 25 GB/s
               Link 1: 25 GB/s
               Link 2: 25 GB/s
               Link 3: 25 GB/s
               Link 4: 25 GB/s
               Link 5: 25 GB/s
               Link 6: 25 GB/s
               Link 7: 25 GB/s
               Link 8: 25 GB/s
               Link 9: 25 GB/s
               Link 10: 25 GB/s
               Link 11: 25 GB/s
      GPU 3: NVIDIA A100-SXM4-80GB (UUID: GPU-4c397bbf-95fc-5c29-918a-a429cbe45a7a)
               Link 0: 25 GB/s
               Link 1: 25 GB/s
               Link 2: 25 GB/s
               Link 3: 25 GB/s
               Link 4: 25 GB/s
               Link 5: 25 GB/s
               Link 6: 25 GB/s
               Link 7: 25 GB/s
               Link 8: 25 GB/s
               Link 9: 25 GB/s
               Link 10: 25 GB/s
               Link 11: 25 GB/s
      GPU 4: NVIDIA A100-SXM4-80GB (UUID: GPU-0e350204-9fb6-2cbe-538e-8f7849658eb8)
               Link 0: 25 GB/s
               Link 1: 25 GB/s
               Link 2: 25 GB/s
               Link 3: 25 GB/s
               Link 4: 25 GB/s
               Link 5: 25 GB/s
               Link 6: 25 GB/s
               Link 7: 25 GB/s
               Link 8: 25 GB/s
               Link 9: 25 GB/s
               Link 10: 25 GB/s
               Link 11: 25 GB/s
      GPU 5: NVIDIA A100-SXM4-80GB (UUID: GPU-45f0c453-4760-edd4-3af9-25c5ea7473a5)
               Link 0: 25 GB/s
               Link 1: 25 GB/s
               Link 2: 25 GB/s
               Link 3: 25 GB/s
               Link 4: 25 GB/s
               Link 5: 25 GB/s
               Link 6: 25 GB/s
               Link 7: 25 GB/s
               Link 8: 25 GB/s
               Link 9: 25 GB/s
               Link 10: 25 GB/s
               Link 11: 25 GB/s
      GPU 6: NVIDIA A100-SXM4-80GB (UUID: GPU-38409794-bb34-430e-3c50-90b42cb2bb72)
               Link 0: 25 GB/s
               Link 1: 25 GB/s
               Link 2: 25 GB/s
               Link 3: 25 GB/s
               Link 4: 25 GB/s
               Link 5: 25 GB/s
               Link 6: 25 GB/s
               Link 7: 25 GB/s
               Link 8: 25 GB/s
               Link 9: 25 GB/s
               Link 10: 25 GB/s
               Link 11: 25 GB/s
      GPU 7: NVIDIA A100-SXM4-80GB (UUID: GPU-3fb478aa-801b-eb64-55c2-0ffc3f2ce404)
               Link 0: 25 GB/s
               Link 1: 25 GB/s
               Link 2: 25 GB/s
               Link 3: 25 GB/s
               Link 4: 25 GB/s
               Link 5: 25 GB/s
               Link 6: 25 GB/s
               Link 7: 25 GB/s
               Link 8: 25 GB/s
               Link 9: 25 GB/s
               Link 10: 25 GB/s
               Link 11: 25 GB/s
      GPU 1: NVIDIA A100-SXM4-80GB (UUID: GPU-64a2f685-bb12-c4af-105c-0726ece9c8d7)
               Link 0: 25 GB/s
               Link 1: 25 GB/s
               Link 2: 25 GB/s
               Link 3: 25 GB/s
               Link 4: 25 GB/s
               Link 5: 25 GB/s
               Link 6: 25 GB/s
               Link 7: 25 GB/s
               Link 8: 25 GB/s
               Link 9: 25 GB/s
               Link 10: 25 GB/s
               Link 11: 25 GB/s
      GPU 2: NVIDIA A100-SXM4-80GB (UUID: GPU-2269851b-71cd-f6c7-50c5-ba1525cf3ce8)
               Link 0: 25 GB/s
               Link 1: 25 GB/s
               Link 2: 25 GB/s
               Link 3: 25 GB/s
               Link 4: 25 GB/s
               Link 5: 25 GB/s
               Link 6: 25 GB/s
               Link 7: 25 GB/s
               Link 8: 25 GB/s
               Link 9: 25 GB/s
               Link 10: 25 GB/s
               Link 11: 25 GB/s
      GPU 3: NVIDIA A100-SXM4-80GB (UUID: GPU-4c397bbf-95fc-5c29-918a-a429cbe45a7a)
               Link 0: 25 GB/s
               Link 1: 25 GB/s
               Link 2: 25 GB/s
               Link 3: 25 GB/s
               Link 4: 25 GB/s
               Link 5: 25 GB/s
               Link 6: 25 GB/s
               Link 7: 25 GB/s
               Link 8: 25 GB/s
               Link 9: 25 GB/s
               Link 10: 25 GB/s
               Link 11: 25 GB/s
      GPU 4: NVIDIA A100-SXM4-80GB (UUID: GPU-0e350204-9fb6-2cbe-538e-8f7849658eb8)
               Link 0: 25 GB/s
               Link 1: 25 GB/s
               Link 2: 25 GB/s
               Link 3: 25 GB/s
               Link 4: 25 GB/s
               Link 5: 25 GB/s
               Link 6: 25 GB/s
               Link 7: 25 GB/s
               Link 8: 25 GB/s
               Link 9: 25 GB/s
               Link 10: 25 GB/s
               Link 11: 25 GB/s
      GPU 5: NVIDIA A100-SXM4-80GB (UUID: GPU-45f0c453-4760-edd4-3af9-25c5ea7473a5)
               Link 0: 25 GB/s
               Link 1: 25 GB/s
               Link 2: 25 GB/s
               Link 3: 25 GB/s
               Link 4: 25 GB/s
               Link 5: 25 GB/s
               Link 6: 25 GB/s
               Link 7: 25 GB/s
               Link 8: 25 GB/s
               Link 9: 25 GB/s
               Link 10: 25 GB/s
               Link 11: 25 GB/s
      GPU 6: NVIDIA A100-SXM4-80GB (UUID: GPU-38409794-bb34-430e-3c50-90b42cb2bb72)
               Link 0: 25 GB/s
               Link 1: 25 GB/s
               Link 2: 25 GB/s
               Link 3: 25 GB/s
               Link 4: 25 GB/s
               Link 5: 25 GB/s
               Link 6: 25 GB/s
               Link 7: 25 GB/s
               Link 8: 25 GB/s
               Link 9: 25 GB/s
               Link 10: 25 GB/s
               Link 11: 25 GB/s
      GPU 7: NVIDIA A100-SXM4-80GB (UUID: GPU-3fb478aa-801b-eb64-55c2-0ffc3f2ce404)
               Link 0: 25 GB/s
               Link 1: 25 GB/s
               Link 2: 25 GB/s
               Link 3: 25 GB/s
               Link 4: 25 GB/s
               Link 5: 25 GB/s
               Link 6: 25 GB/s
               Link 7: 25 GB/s
               Link 8: 25 GB/s
               Link 9: 25 GB/s
               Link 10: 25 GB/s
               Link 11: 25 GB/s
      NVSwitch status example

    • Check NVLinks hardware status

      Color mode
      ~$ nvidia-smi topo -m
      ~$ nvidia-smi topo -m
      Example code for checking NVLink hardware status
      Color mode
              GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity   GPU NUMA ID
      GPU0     X      NV12    NV12    NV12    NV12    NV12    NV12    NV12    0-127   0-7             N/A
      GPU1    NV12	 X      NV12    NV12    NV12    NV12    NV12    NV12    0-127   0-7             N/A
      GPU2    NV12    NV12     X      NV12    NV12    NV12    NV12    NV12    0-127   0-7             N/A
      GPU3    NV12    NV12    NV12     X      NV12    NV12    NV12    NV12    0-127   0-7             N/A
      GPU4    NV12    NV12    NV12    NV12     X      NV12    NV12    NV12    0-127   0-7             N/A
      GPU5    NV12    NV12    NV12    NV12    NV12     X      NV12    NV12    0-127   0-7             N/A
      GPU6    NV12    NV12    NV12    NV12    NV12    NV12     X      NV12    0-127   0-7             N/A
      GPU7    NV12    NV12    NV12    NV12    NV12    NV12    NV12     X      0-127   0-7             N/A
      
      Legend:
      
        X    = Self
        SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
        NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
        PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
        PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
        PIX  = Connection traversing at most a single PCIe bridge
        NV#  = Connection traversing a bonded set of # NVLinks
              GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity   GPU NUMA ID
      GPU0     X      NV12    NV12    NV12    NV12    NV12    NV12    NV12    0-127   0-7             N/A
      GPU1    NV12	 X      NV12    NV12    NV12    NV12    NV12    NV12    0-127   0-7             N/A
      GPU2    NV12    NV12     X      NV12    NV12    NV12    NV12    NV12    0-127   0-7             N/A
      GPU3    NV12    NV12    NV12     X      NV12    NV12    NV12    NV12    0-127   0-7             N/A
      GPU4    NV12    NV12    NV12    NV12     X      NV12    NV12    NV12    0-127   0-7             N/A
      GPU5    NV12    NV12    NV12    NV12    NV12     X      NV12    NV12    0-127   0-7             N/A
      GPU6    NV12    NV12    NV12    NV12    NV12    NV12     X      NV12    0-127   0-7             N/A
      GPU7    NV12    NV12    NV12    NV12    NV12    NV12    NV12     X      0-127   0-7             N/A
      
      Legend:
      
        X    = Self
        SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
        NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
        PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
        PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
        PIX  = Connection traversing at most a single PCIe bridge
        NV#  = Connection traversing a bonded set of # NVLinks
      NVLink HW status check code example

  3. Check the InfiniBand (IB) HCA card hardware status and link.

    Color mode
    user@bm-dev-001:~$ ibdev2netdev -v
    user@bm-dev-001:~$ ibdev2netdev -v
    HW status check command example
    Color mode
    cat: /sys/class/infiniband/mlx5_0/device/vpd: Permission denied
    0000:45:00.0 mlx5_0 (MT4123 -            )                 fw 20.29.1016 port 1 (ACTIVE) ==> ibs18 (Down)
    cat: /sys/class/infiniband/mlx5_1/device/vpd: Permission denied
    0000:0e:00.0 mlx5_1 (MT4123 -            )                 fw 20.29.1016 port 1 (ACTIVE) ==> ibs17 (Down)
    cat: /sys/class/infiniband/mlx5_2/device/vpd: Permission denied
    0000:c5:00.0 mlx5_2 (MT4123 -            )                 fw 20.29.1016 port 1 (ACTIVE) ==> ibs20 (Down)
    cat: /sys/class/infiniband/mlx5_3/device/vpd: Permission denied
    0000:85:00.0 mlx5_3 (MT4123 -            )                 fw 20.29.1016 port 1 (ACTIVE) ==> ibs19 (Down)
    user@bm-dev-001:~$
    cat: /sys/class/infiniband/mlx5_0/device/vpd: Permission denied
    0000:45:00.0 mlx5_0 (MT4123 -            )                 fw 20.29.1016 port 1 (ACTIVE) ==> ibs18 (Down)
    cat: /sys/class/infiniband/mlx5_1/device/vpd: Permission denied
    0000:0e:00.0 mlx5_1 (MT4123 -            )                 fw 20.29.1016 port 1 (ACTIVE) ==> ibs17 (Down)
    cat: /sys/class/infiniband/mlx5_2/device/vpd: Permission denied
    0000:c5:00.0 mlx5_2 (MT4123 -            )                 fw 20.29.1016 port 1 (ACTIVE) ==> ibs20 (Down)
    cat: /sys/class/infiniband/mlx5_3/device/vpd: Permission denied
    0000:85:00.0 mlx5_3 (MT4123 -            )                 fw 20.29.1016 port 1 (ACTIVE) ==> ibs19 (Down)
    user@bm-dev-001:~$
    Example of HW status check result
    Color mode
    root@bm-dev-001:~# ibstat
    root@bm-dev-001:~# ibstat
    Example of link verification command
    Color mode
    CA 'mlx5_0'
            CA type: MT4123
            Number of ports: 1
            Firmware version: 20.29.1016
            Hardware version: 0
            Node GUID: 0x88e9a4ffff5060ac
            System image GUID: 0x88e9a4ffff5060ac
            Port 1:
                    State: Active
                    Physical state: LinkUp
                    Rate: 200
                    Base lid: 8
                    LMC: 0
                    SM lid: 1
                    Capability mask: 0x2651e848
                    Port GUID: 0x88e9a4ffff5060ac
                    Link layer: InfiniBand
    CA 'mlx5_1'
            CA type: MT4123
            Number of ports: 1
            Firmware version: 20.29.1016
            Hardware version: 0
            Node GUID: 0x88e9a4ffff504080
            System image GUID: 0x88e9a4ffff504080
            Port 1:
                    State: Active
                    Physical state: LinkUp
                    Rate: 200
                    Base lid: 5
                    LMC: 0
                    SM lid: 1
                    Capability mask: 0x2651e848
                    Port GUID: 0x88e9a4ffff504080
                    Link layer: InfiniBand
    CA 'mlx5_2'
            CA type: MT4123
            Number of ports: 1
            Firmware version: 20.29.1016
            Hardware version: 0
            Node GUID: 0x88e9a4ffff505038
            System image GUID: 0x88e9a4ffff505038
            Port 1:
                    State: Active
                    Physical state: LinkUp
                    Rate: 200
                    Base lid: 2
                    LMC: 0
                    SM lid: 1
                    Capability mask: 0x2651e848
                    Port GUID: 0x88e9a4ffff505038
                    Link layer: InfiniBand
    CA 'mlx5_3'
            CA type: MT4123
            Number of ports: 1
            Firmware version: 20.29.1016
            Hardware version: 0
            Node GUID: 0x88e9a4ffff504094
            System image GUID: 0x88e9a4ffff504094
            Port 1:
                    State: Active
                    Physical state: LinkUp
                    Rate: 200
                    Base lid: 7
                    LMC: 0
                    SM lid: 1
                    Capability mask: 0x2651e848
                    Port GUID: 0x88e9a4ffff504094
                    Link layer: InfiniBand
    CA 'mlx5_0'
            CA type: MT4123
            Number of ports: 1
            Firmware version: 20.29.1016
            Hardware version: 0
            Node GUID: 0x88e9a4ffff5060ac
            System image GUID: 0x88e9a4ffff5060ac
            Port 1:
                    State: Active
                    Physical state: LinkUp
                    Rate: 200
                    Base lid: 8
                    LMC: 0
                    SM lid: 1
                    Capability mask: 0x2651e848
                    Port GUID: 0x88e9a4ffff5060ac
                    Link layer: InfiniBand
    CA 'mlx5_1'
            CA type: MT4123
            Number of ports: 1
            Firmware version: 20.29.1016
            Hardware version: 0
            Node GUID: 0x88e9a4ffff504080
            System image GUID: 0x88e9a4ffff504080
            Port 1:
                    State: Active
                    Physical state: LinkUp
                    Rate: 200
                    Base lid: 5
                    LMC: 0
                    SM lid: 1
                    Capability mask: 0x2651e848
                    Port GUID: 0x88e9a4ffff504080
                    Link layer: InfiniBand
    CA 'mlx5_2'
            CA type: MT4123
            Number of ports: 1
            Firmware version: 20.29.1016
            Hardware version: 0
            Node GUID: 0x88e9a4ffff505038
            System image GUID: 0x88e9a4ffff505038
            Port 1:
                    State: Active
                    Physical state: LinkUp
                    Rate: 200
                    Base lid: 2
                    LMC: 0
                    SM lid: 1
                    Capability mask: 0x2651e848
                    Port GUID: 0x88e9a4ffff505038
                    Link layer: InfiniBand
    CA 'mlx5_3'
            CA type: MT4123
            Number of ports: 1
            Firmware version: 20.29.1016
            Hardware version: 0
            Node GUID: 0x88e9a4ffff504094
            System image GUID: 0x88e9a4ffff504094
            Port 1:
                    State: Active
                    Physical state: LinkUp
                    Rate: 200
                    Base lid: 7
                    LMC: 0
                    SM lid: 1
                    Capability mask: 0x2651e848
                    Port GUID: 0x88e9a4ffff504094
                    Link layer: InfiniBand
    Link verification result example

Check IB bandwidth communication

Check the IB bandwidth communication status (ib_send_bw) and follow the steps below to inspect the IaaS hardware level.

  1. Check the name of the IB HCA interface.

    • In the following example, IB ports: mlx5_0, mlx5_4, mlx_5_5, mlx5_8
      Color mode
      ~$ ibdev2netdev -v
      ~$ ibdev2netdev -v
      Example of checking IB HCA interface name
      Color mode
      0000:1a:00.0 mlx5_0 (MT4129 - 1028SN     ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (ACTIVE) ==> ibp26s0 (Down)
      0000:1b:00.0 mlx5_1 (MT4123 - 1028SN     ) Mellanox ConnectX-6 Single Port VPI HDR QSFP Adapter fw 20.38.1002 port 1 (ACTIVE) ==> bond-nas (Up)
      0000:3c:00.0 mlx5_2 (MT4129 - 1028SN     ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (DOWN  ) ==> ibp60s0 (Down)
      0000:4d:00.0 mlx5_3 (MT4129 - 1028SN     ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (DOWN  ) ==> ibp77s0 (Down)
      0000:5e:00.0 mlx5_4 (MT4129 - 1028SN     ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (ACTIVE) ==> ibp94s0 (Down)
      0000:9c:00.0 mlx5_5 (MT4129 - 1028SN     ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (ACTIVE) ==> ibp156s0 (Down)
      0000:9d:00.0 mlx5_6 (MT4123 - 1028SN     ) Mellanox ConnectX-6 Single Port VPI HDR QSFP Adapter fw 20.38.1002 port 1 (ACTIVE) ==> bond-nas (Up)
      0000:bc:00.0 mlx5_7 (MT4129 - 1028SN     ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (DOWN  ) ==> ibp188s0 (Down)
      0000:cc:00.0 mlx5_8 (MT4129 - 1028SN     ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (ACTIVE) ==> ibp204s0 (Down)
      0000:dc:00.0 mlx5_9 (MT4129 - 1028SN     ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (DOWN  ) ==> ibp220s0 (Down)
      0000:1a:00.0 mlx5_0 (MT4129 - 1028SN     ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (ACTIVE) ==> ibp26s0 (Down)
      0000:1b:00.0 mlx5_1 (MT4123 - 1028SN     ) Mellanox ConnectX-6 Single Port VPI HDR QSFP Adapter fw 20.38.1002 port 1 (ACTIVE) ==> bond-nas (Up)
      0000:3c:00.0 mlx5_2 (MT4129 - 1028SN     ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (DOWN  ) ==> ibp60s0 (Down)
      0000:4d:00.0 mlx5_3 (MT4129 - 1028SN     ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (DOWN  ) ==> ibp77s0 (Down)
      0000:5e:00.0 mlx5_4 (MT4129 - 1028SN     ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (ACTIVE) ==> ibp94s0 (Down)
      0000:9c:00.0 mlx5_5 (MT4129 - 1028SN     ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (ACTIVE) ==> ibp156s0 (Down)
      0000:9d:00.0 mlx5_6 (MT4123 - 1028SN     ) Mellanox ConnectX-6 Single Port VPI HDR QSFP Adapter fw 20.38.1002 port 1 (ACTIVE) ==> bond-nas (Up)
      0000:bc:00.0 mlx5_7 (MT4129 - 1028SN     ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (DOWN  ) ==> ibp188s0 (Down)
      0000:cc:00.0 mlx5_8 (MT4129 - 1028SN     ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (ACTIVE) ==> ibp204s0 (Down)
      0000:dc:00.0 mlx5_9 (MT4129 - 1028SN     ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (DOWN  ) ==> ibp220s0 (Down)
      Example of IB HCA interface name verification result
  2. Use the SERVER Side command to check the communication status.

    Color mode
    ~$ ib_send_bw -d mlx5_0 -i 1 –F
    ~$ ib_send_bw -d mlx5_0 -i 1 –F
    SERVER Side command example
    Color mode
    ************************************
    * Waiting for client to connect... * *
    ************************************
    ---------------------------------------------------------------------------------------
                        Send BW Test
     Dual-port       : OFF		Device         : mlx5_0
     Number of qps   : 1		Transport type : IB
     Connection type : RC		Using SRQ      : OFF
     PCIe relax order: ON
     ibv_wr* API     : ON
     TX depth        : 128
     CQ Moderation   : 100
     Mtu             : 4096[B]
     Link type       : IB
     Max inline data : 0[B]
     rdma_cm QPs	 : OFF
     Data ex.  method : Ethernet
    ---------------------------------------------------------------------------------------
     local address: LID 0x0f QPN 0x6d95 PSN 0xb974a
     remote address: LID 0x01 QPN 0x6dd2 PSN 0xc8a18c
    ---------------------------------------------------------------------------------------
     #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
     65536      1000             0.00               19827.40                   0.317238
    ---------------------------------------------------------------------------------------
    ************************************
    * Waiting for client to connect... * *
    ************************************
    ---------------------------------------------------------------------------------------
                        Send BW Test
     Dual-port       : OFF		Device         : mlx5_0
     Number of qps   : 1		Transport type : IB
     Connection type : RC		Using SRQ      : OFF
     PCIe relax order: ON
     ibv_wr* API     : ON
     TX depth        : 128
     CQ Moderation   : 100
     Mtu             : 4096[B]
     Link type       : IB
     Max inline data : 0[B]
     rdma_cm QPs	 : OFF
     Data ex.  method : Ethernet
    ---------------------------------------------------------------------------------------
     local address: LID 0x0f QPN 0x6d95 PSN 0xb974a
     remote address: LID 0x01 QPN 0x6dd2 PSN 0xc8a18c
    ---------------------------------------------------------------------------------------
     #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
     65536      1000             0.00               19827.40                   0.317238
    ---------------------------------------------------------------------------------------
    Example of mutual communication status check result

  3. Use the CLIENT Side command to check the communication status.

    Color mode
    ~$ ib_send_bw -d mlx5_0 -i 1 -F <SERVER Side IP>
    ~$ ib_send_bw -d mlx5_0 -i 1 -F <SERVER Side IP>
    CLIENT Side command example
    Color mode
    ---------------------------------------------------------------------------------------
                        Send BW Test
     Dual-port       : OFF		Device         : mlx5_0
     Number of qps   : 1		Transport type : IB
     Connection type : RC		Using SRQ      : OFF
     PCIe relax order: ON
     ibv_wr* API     : ON
     RX depth        : 512
     CQ Moderation   : 100
     Mtu             : 4096[B]
     Link type       : IB
     Max inline data : 0[B]
     rdma_cm QPs	 : OFF
     Data ex.  method : Ethernet
    ---------------------------------------------------------------------------------------
     local address: LID 0x01 QPN 0x6dd2 PSN 0xc8a18c
     remote address: LID 0x0f QPN 0x6d95 PSN 0xb974a
    ---------------------------------------------------------------------------------------
     #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
     65536      1000             19008.49            19006.37                  0.304102
    ---------------------------------------------------------------------------------------
    ---------------------------------------------------------------------------------------
                        Send BW Test
     Dual-port       : OFF		Device         : mlx5_0
     Number of qps   : 1		Transport type : IB
     Connection type : RC		Using SRQ      : OFF
     PCIe relax order: ON
     ibv_wr* API     : ON
     RX depth        : 512
     CQ Moderation   : 100
     Mtu             : 4096[B]
     Link type       : IB
     Max inline data : 0[B]
     rdma_cm QPs	 : OFF
     Data ex.  method : Ethernet
    ---------------------------------------------------------------------------------------
     local address: LID 0x01 QPN 0x6dd2 PSN 0xc8a18c
     remote address: LID 0x0f QPN 0x6d95 PSN 0xb974a
    ---------------------------------------------------------------------------------------
     #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
     65536      1000             19008.49            19006.37                  0.304102
    ---------------------------------------------------------------------------------------
    Example of mutual communication status check result

Check IB service related kernel modules

Check the relevant kernel modules for the IB service (lsmod) to inspect the IaaS hardware level.

  • Check IB service-related kernel module - nvidia_peermem

    Color mode
    ~$ lsmod | grep nvidia_peermem
    ~$ lsmod | grep nvidia_peermem
    Example command to check nvidia_peermem
    Color mode
    nvidia_peermem         16384  0
    ib_core               425984  9 rdma_cm,ib_ipoib,nvidia_peermem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
    nvidia              56524800  451 nvidia_uvm,nvidia_peermem,nvidia_modeset
    nvidia_peermem         16384  0
    ib_core               425984  9 rdma_cm,ib_ipoib,nvidia_peermem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
    nvidia              56524800  451 nvidia_uvm,nvidia_peermem,nvidia_modeset
    Example of nvidia_peermem check result

  • Check IB service related kernel modules - IB Card (HCA) driver

    Color mode
    ~$ lsmod|egrep 'ib_|_ib|mlx'
    ~$ lsmod|egrep 'ib_|_ib|mlx'
    Example command to check the IB Card (HCA) driver
    Color mode
    mlx5_ib               393216  0
    ib_uverbs             163840  2 irdma,mlx5_ib
    ib_core               393216  3 irdma,ib_uverbs,mlx5_ib
    mlx5_core            1593344  1 mlx5_ib
    mlxfw                  32768  1 mlx5_core
    psample                20480  1 mlx5_core
    tls                   114688  1 mlx5_core
    pci_hyperv_intf        16384  1 mlx5_core
    ib_ipoib              139264  0
    ib_cm                 131072  2 rdma_cm,ib_ipoib
    ib_umad                40960  0
    mlx5_ib               454656  0
    ib_uverbs             135168  2 rdma_ucm,mlx5_ib
    ib_core               434176  9 rdma_cm,ib_ipoib,nvidia_peermem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
    libcrc32c              16384  5 nf_conntrack,nf_nat,btrfs,nf_tables,raid456
    mlx5_core            2064384  1 mlx5_ib
    mlx_compat             69632  11 rdma_cm,ib_ipoib,mlxdevm,iw_cm,ib_umad,ib_core,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm,mlx5_core
    mlx5_ib               393216  0
    ib_uverbs             163840  2 irdma,mlx5_ib
    ib_core               393216  3 irdma,ib_uverbs,mlx5_ib
    mlx5_core            1593344  1 mlx5_ib
    mlxfw                  32768  1 mlx5_core
    psample                20480  1 mlx5_core
    tls                   114688  1 mlx5_core
    pci_hyperv_intf        16384  1 mlx5_core
    ib_ipoib              139264  0
    ib_cm                 131072  2 rdma_cm,ib_ipoib
    ib_umad                40960  0
    mlx5_ib               454656  0
    ib_uverbs             135168  2 rdma_ucm,mlx5_ib
    ib_core               434176  9 rdma_cm,ib_ipoib,nvidia_peermem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
    libcrc32c              16384  5 nf_conntrack,nf_nat,btrfs,nf_tables,raid456
    mlx5_core            2064384  1 mlx5_ib
    mlx_compat             69632  11 rdma_cm,ib_ipoib,mlxdevm,iw_cm,ib_umad,ib_core,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm,mlx5_core
    Example of IB Card (HCA) driver verification result

Check storage physical disk resources and Multi-Path

Verify the storage physical disk resources and Multi-Path to assess the IaaS hardware level.

  • Storage Physical Disk Resource Check Results

    Color mode
    root@bm-dev-002:/tmp# fdisk –l
    root@bm-dev-002:/tmp# fdisk –l
    Example of storage physical disk resource check result

  • Multi-Path verification result

    Color mode
    root@bm-dev-002:/tmp# multipath –ll
    root@bm-dev-002:/tmp# multipath –ll
    Example of Multi-Path verification result

Check Service Network after new deployment of Multi-node GPU Cluster

Use the following command to verify that the MII Status of Bonding and Slave Interface is up.

  • Service Network check command

    Color mode
    ~$ cat /proc/net/bonding/bond-srv
    ~$ cat /proc/net/bonding/bond-srv
    Example command to check Service Network

  • Service Network check result

    Color mode
    Ethernet Channel Bonding Driver: v5.15.0-25-generic
    Bonding Mode: fault-tolerance (active-backup)
    Primary Slave: None
    Currently Active Slave: ens9f0
    MII Status: up
    MII Polling Interval (ms): 100
    Up Delay (ms): 0
    Down Delay (ms): 0
    Peer Notification Delay (ms): 0
    
    Slave Interface: ens9f0
    MII Status: up
    Speed: 100000 Mbps
    Duplex: full
    Link Failure Count: 0
    Permanent HW addr: 30:3e:a7:02:35:70
    Slave queue ID: 0
    
    Slave Interface: ens11f0
    MII Status: up
    Speed: 100000 Mbps
    Duplex: full
    Link Failure Count: 0
    Permanent HW addr: 30:3e:a7:02:2f:e8
    Slave queue ID: 0
    Ethernet Channel Bonding Driver: v5.15.0-25-generic
    Bonding Mode: fault-tolerance (active-backup)
    Primary Slave: None
    Currently Active Slave: ens9f0
    MII Status: up
    MII Polling Interval (ms): 100
    Up Delay (ms): 0
    Down Delay (ms): 0
    Peer Notification Delay (ms): 0
    
    Slave Interface: ens9f0
    MII Status: up
    Speed: 100000 Mbps
    Duplex: full
    Link Failure Count: 0
    Permanent HW addr: 30:3e:a7:02:35:70
    Slave queue ID: 0
    
    Slave Interface: ens11f0
    MII Status: up
    Speed: 100000 Mbps
    Duplex: full
    Link Failure Count: 0
    Permanent HW addr: 30:3e:a7:02:2f:e8
    Slave queue ID: 0
    Example of Service Network verification result

Reference
If some Slave Interface is in a down state, use the Contact of the Support Center to report the abnormal situation and receive a response.

Multi-node GPU Cluster: Verify time synchronization with the Time Server after new deployment

The OS image includes the installation of the chrony daemon and configuration for SCP NTP server synchronization. Use the following command to verify whether a line marked with ^* exists in the MS Name column.

  • Command to check the chrony daemon source and synchronization status

    Color mode
    ~$ chronyc sources -V
    ~$ chronyc sources -V
    Example of status check command

  • Result of checking chrony daemon status

    Color mode
    MS Name/IP address         Stratum Poll Reach LastRx Last sample
    ===============================================================================
    ^* 198.19.0.54                   4  10   377  1040    -16us[  -37us] +/- 9982us
    MS Name/IP address         Stratum Poll Reach LastRx Last sample
    ===============================================================================
    ^* 198.19.0.54                   4  10   377  1040    -16us[  -37us] +/- 9982us
    Example of checking the chrony daemon status