This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

How-to guides

The user can enter the required information for the Multi-node GPU Cluster service through the Samsung Cloud Platform Console, select detailed options, and create the service.

Multi-node GPU Cluster Getting Started

You can create and use a Multi-node GPU Cluster service in the Samsung Cloud Platform Console.

This service consists of GPU Node and Cluster Fabric services.

GPU Node Creation

To create a Multi-node GPU Cluster, follow the steps below.

  1. All Services > Compute > Multi-node GPU Cluster Click the menu. Navigate to the Service Home page of Multi-node GPU Cluster.
  2. Click the GPU Node creation button on the Service Home page. You will be taken to the GPU Node creation page.
  3. GPU Node creation on the page, enter the information required to create the service, and select detailed options.
    • Image and Version Selection Select the required information in the area.
      Category
      Required
      Detailed description
      ImageRequiredSelect provided image type
      • Ubuntu
      Image VersionRequiredSelect version of the chosen image
      • Provides a list of versions of the provided server images
      Table. GPU Node image and version selection items
  • Enter service information area, input or select the required information.
    Category
    Required
    Detailed description
    Number of serversRequiredNumber of GPU Node servers to create simultaneously
    • Only numbers can be entered, and the minimum number of servers to create is 2.
    • Only during the initial setup can you create 2 or more, and expansion is possible one at a time.
    Service Type > Server TypeRequiredGPU Node Server Type
    • Select desired CPU, Memory, GPU, Disk specifications
    Service Type > Planned ComputeRequiredStatus of resources with Planned Compute set
    • In Use: Number of resources with Planned Compute that are currently in use
    • Configured: Number of resources with Planned Compute set
    • Coverage Preview: Amount applied per resource by Planned Compute
    • Planned Compute Service Application: Go to the Planned Compute service application page
    Table. GPU Node Service Information Input Items
    • Required Information Input area, enter or select the required information.
      Category
      Required or not
      Detailed description
      Administrator AccountRequiredSet the administrator account and password to be used when connecting to the server
      • Ubuntu OS is provided fixed as root
      Server Name PrefixRequiredEnter a Prefix to distinguish each GPU Node generated when the number of selected servers is 2 or more
      • Automatically generated as user input value (prefix) + ‘-###’ format
      • Start with a lowercase English letter, and use lowercase letters, numbers, and special characters (-) within 3 to 11 characters
      • Must not end with a special character (-)
      Network SettingsRequiredSet the network where the GPU Node will be installed
      • VPC Name:Select a pre-created VPC
      • General Subnet Name: Select a pre-created general Subnet
        • IP can be set to auto-generate or user input, and if input is selected, the user enters the IP directly
      • NAT: Can be used only when there is 1 server and the VPC has an Internet Gateway attached. Checking ‘use’ allows selection of a NAT IP. (When first created, it is generated only with 2 or more servers, so modify on the resource detail page)
      • NAT IP: Select NAT IP
        • If there is no NAT IP to select, click the Create New button to generate a Public IP
        • Click the Refresh button to view and select the created Public IP
        • Creating a Public IP incurs charges according to the Public IP pricing policy
      Table. GPU Node required information entry items
  • Cluster selection area, create or select a Cluster Fabric.
    Category
    Required
    Detailed description
    Cluster FabricRequiredSetting of a group of GPU Node servers that can apply GPU Direct RDMA together
    • Optimal GPU performance and speed can be secured only within the same Cluster Fabric
    • When creating a new Cluster Fabric, select *New Input > Node pool, then enter the name of the Cluster Fabric to be created
    • To add to an existing Cluster Fabric, select Existing Input > Node pool, then select the already created Cluster Fabric
    Table. GPU Node Cluster Fabric selection items
    • Additional Information Input Enter or select the required information in the area.
      Category
      Required or not
      Detailed description
      LockSelectUsing Lock prevents accidental actions that could terminate/start/stop the server
      Init ScriptSelectScript to run when the server starts
      • Init Script must be selected differently depending on the image type
        • For Linux: Select Shell Script or cloud-init
      TagSelectAdd Tag
      • Up to 50 can be added per resource
      • After clicking the Add Tag button, enter or select Key, Value values
      Table. GPU Node additional information input items
  1. Summary Check the detailed information and estimated billing amount generated in the panel, and click the Complete button.
    • Once creation is complete, check the created resources on the GPU Node List page.
Caution
  • When creating a service, the GPU MIG/ECC settings are reset. However, to apply the correct settings, perform a one-time reboot initially, verify whether the settings have been applied, and then use it.
  • For detailed information about GPU MIG/ECC settings reset, please refer to GPU MIG/ECC Settings Reset Checklist Guide.

GPU Node Check Detailed Information

The Multi-node GPU Cluster service allows you to view and edit the full list of GPU Node resources and detailed information.

GPU Node Details page consists of Details, Tags, Job History tabs.

To view detailed information of the GPU Node, follow the steps below.

  1. All Services > Compute > Multi-node GPU Cluster > GPU Node Click the menu. Navigate to the Service Home page of Multi-node GPU Cluster.

  2. Click the GPU Node menu on the Service Home page. Navigate to the GPU Node List page.

    • Resource items other than required columns can be added via the Settings button.
      Category
      Required or not
      Detailed description
      Resource IDSelectUser-created GPU Node ID
      Cluster Fabric namerequiredCluster Fabric name created by the user
      Server nameRequiredUser-created GPU Node name
      Server TypeRequiredServer type of GPU Node
      • User can check the number of cores, memory capacity, GPU type and count of the created resources
      ImageRequiredUser-generated GPU Node image version
      IPRequiredIP of the GPU Node created by the user
      StatusRequiredStatus of the GPU Node created by the user
      Creation TimeSelectGPU Node creation time
      Table. GPU Node Resource List Items
  3. GPU Node List Click the resource to view detailed information. GPU Node Details You will be taken to the page.

    • GPU Server Details At the top of the page, status information and descriptions of additional features are displayed.
      CategoryDetailed description
      GPU Node statusStatus of GPU Node created by the user
      • Creating: State where the server is being created
      • Running:: State where creation is complete and usable
      • Editing:: State where IP is being changed
      • Unknown: Error state
      • Starting: State where the server is starting
      • Stopping: State where the server is stopping
      • Stopped: State where the server has stopped
      • Terminating: State where termination is in progress
      • Terminated: State where termination is complete
      Server ControlButton to change server status
      • Start: Start a stopped server
      • Stop: Stop a running server
      Service cancellationButton to cancel the service
      Table. GPU Node status information and additional features

Detailed Information

GPU Node List page’s Details Tab you can view the detailed information of the selected resource, and if necessary, edit the information.

CategoryDetailed description
ServiceService Name
Resource TypeResource Type
SRNUnique resource ID in Samsung Cloud Platform
  • In GPU Node, it means GPU Node SRN
Resource NameResource Name
  • In the GPU Node service, it means the GPU Node name
Resource IDUnique resource ID in the service
CreatorUser who created the service
Creation timeService creation time
ModifierUser who edited the service information
Modification date/timeDate and time when the service information was modified
Server nameServer name
Node poolA group of nodes that can be grouped together with the same Cluster Fabric
Cluster Fabric nameCluster Fabric name created by the user
Image/VersionServer’s OS image and version
Server TypeCPU, Memory, GPU, Information Display
Planned ComputeResource status with Planned Compute set
LockDisplay lock usage status
  • If lock is used, it prevents server termination/start/stop to avoid accidental actions
  • If you need to change the lock attribute value, click the Edit button to set
NetworkGPU Node network information
  • VPC name, general Subnet name, IP, IP status, NAT IP, NAT IP status
Block StorageBlock Storage information connected to the server
  • Volume name, disk type, capacity, status
Init ScriptView the Init Script content entered when creating the server
Table. GPU Node detailed information tab items

Tag

GPU Node List page’s Tag Tab you can view the tag information of the selected resource, and add, modify, or delete it.

CategoryDetailed description
Tag ListTag List
  • Tag’s Key, Value information can be checked
  • Up to 50 tags can be added per resource
  • When entering a tag, search and select from the existing list of Keys and Values
Table. GPU Node Tag Tab Items

Work History

GPU Node List page’s Job History tab allows you to view the job history of the selected resource.

CategoryDetailed description
Work History ListResource Change History
  • Work details, work date and time, resource type, resource name, event topic, work result, check worker information
  • Detailed Search button provides detailed search function
Table. GPU Node Job History Tab Detailed Information Items

GPU Node Operation Control

If you need server control and management functions for the generated GPU Node resources, you can perform tasks on the GPU Node List or GPU Node Details page. You can start and stop the running GPU Node resources.

GPU Node Getting Started

You can start a stopped GPU Node. To start the GPU Node, follow the steps below.

  1. All Services > Compute > Multi-node GPU Cluster Click the menu. Navigate to the Service Home page of Multi-node GPU Cluster.
  2. Click the GPU Node menu on the Service Home page. You will be taken to the GPU Node List page.
    • On the GPU Node List page, after selecting individual or multiple servers with the checkbox, you can Start via the More button at the top.
  3. GPU Node List page, click the resource. GPU Node Details page will be opened.
    • GPU Node Details on the page, click the Start button at the top to start the server.
  4. Check the server status and complete the status change.

Stop GPU Node

You can stop a GPU Node that is active. To stop the GPU Node, follow the steps below.

  1. All Services > Compute > Multi-node GPU Cluster Click the menu. Move to Multi-node GPU Cluster’s Service Home page.
  2. Click the GPU Node menu on the Service Home page. You will be taken to the GPU Node List page.
    • GPU Node List page, you can control individual or multiple servers by selecting the checkboxes and then using the Stop button at the top.
  3. GPU Node List page, click the resource. GPU Node Details page, navigate.
    • GPU Node Details on the page, click the Stop button at the top to stop the server.
  4. Check the server status and complete the status change.

GPU Node Cancel

You can cancel unused GPU nodes to reduce operating costs. However, if you cancel the service, the running service may be stopped immediately, so you should consider the impact of service interruption sufficiently before proceeding with the cancellation.

Caution
Please note that data cannot be recovered after service termination.

To cancel the GPU Node, follow the steps below.

  1. All Services > Compute > Multi-node GPU Server Click the menu. Navigate to the Service Home page of the Multi-node GPU Cluster.
  2. Click the Cluster Fabric menu on the Service Home page. You will be taken to the Cluster Fabric List page.
  3. Cluster Fabric List page, select the resource to cancel, and click the Cancel Service button.
    • Resources using the same Cluster Fabric can be terminated simultaneously.
  4. Once the termination is complete, check on the GPU Node List page whether the resources have been terminated.
Guide

The cases where GPU Node termination is not possible are as follows.

  • When Block Storage (BM) is connected: Please disconnect the Block Storage (BM) connection first.
  • If File Storage is connected: Please disconnect the File Storage first.
  • When Lock is set: Please change the Lock setting to unused and try again.
  • If the server that cannot be terminated simultaneously is included: Please re-select only the resources that can be terminated.
  • If the Cluster Fabric of the server you want to terminate is different: Select only resources that use the same Cluster Fabric.
Reference
If all GPU Nodes in the Cluster Fabric are deleted, the Cluster Fabric is automatically deleted.

1 - Cluster Fabric Management

Cluster Fabric is a service that helps manage servers (GPU Nodes) included in a GPU Cluster. Using Cluster Fabric, you can move servers between GPU Clusters in the same Node pool and optimize the performance and speed of GPUs within the same GPU Cluster.

Creating Cluster Fabric

Cluster Fabric can be created together with a GPU Node, and it cannot be created or deleted separately. When all GPU Nodes within a Cluster Fabric are terminated, the Cluster Fabric is automatically deleted. If you haven’t created a GPU Node, please create one first. For more information, refer to Creating a GPU Node.

Checking Cluster Fabric Details

Guide
  • Cluster Fabric can be created together with a GPU Node, and it cannot be created or deleted separately.
  • When all GPU Nodes within a Cluster Fabric are terminated, the Cluster Fabric is automatically deleted.
  • If you haven’t created a GPU Node, please create one first. For more information, refer to Creating a GPU Node.

You can check the created Cluster Fabric list and details, and move servers on the Cluster Fabric List page and Cluster Fabric Details page.

  1. Click on All Services > Compute > Multi-node GPU Server menu. It will move to the Service Home page of the Multi-node GPU Cluster.

  2. Click on the Cluster Fabric menu on the Service Home page. It will move to the Cluster Fabric List page.

    • On the Cluster Fabric List page, you can view the list of resources of the GPU Cluster created by the user.
    • Resource items other than required columns can be added through the Settings button.
      Category
      Required
      Description
      Resource IDOptionalCluster Fabric ID created by the user
      Cluster Fabric NameRequiredCluster Fabric name created by the user
      Node PoolOptionalA collection of nodes that can be bundled into the same Cluster Fabric
      Number of ServersOptionalNumber of GPU Nodes
      Server TypeOptionalServer type of the GPU Node
      • The user can check the number of cores, memory capacity, and GPU type and number of the created resource
      StatusOptionalStatus of the Cluster Fabric created by the user
      Creation TimeOptionalTime when the Cluster Fabric was created
      Table. Cluster Fabric resource list items
  3. Click on the resource to check the details on the Cluster Fabric List page. It will move to the Cluster Fabric Details page.

    • At the top of the Cluster Fabric Details page, status information and additional feature descriptions are displayed.
      CategoryDescription
      Cluster Fabric StatusStatus of the Cluster Fabric created by the user
      • Creating: Cluster creation in progress
      • Active: Creation completed and available
      • Editing: IP change in progress
      • Deleting: Termination in progress
      • Deleted: Termination completed
      Add Target ServerFunction to move a server from another cluster to this cluster
      Table. Cluster Fabric status information and additional features

Details

On the Details tab of the Cluster Fabric List page, you can check the details of the selected resource and bring in servers from other clusters.

CategoryDescription
ServiceService category
Resource TypeService name
SRNUnique resource ID in Samsung Cloud Platform
  • In Cluster Fabric, it means Cluster Fabric SRN
Resource NameResource name
  • In Cluster Fabric service, it means Cluster Fabric name
Resource IDUnique resource ID in the service
CreatorUser who created the service
Creation TimeTime when the service was created
ModifierUser who modified the service information
Modification TimeTime when the service information was modified
Cluster Fabric NameCluster Fabric name created by the user
Node PoolA collection of nodes that can be bundled into the same Cluster Fabric
Target ServerList of GPU Nodes bound to the Cluster Fabric
  • Server name, server type, IP, status
Table. Cluster Fabric details tab items

Bringing in Cluster Fabric Servers

Using the Add Target Server feature on the Cluster Fabric Details page, you can bring in servers from other clusters and add them to the selected cluster.

  1. Click on All Services > Compute > Multi-node GPU Server menu. It will move to the Service Home page of the Multi-node GPU Cluster.
  2. Click on the Cluster Fabric menu on the Service Home page. It will move to the Cluster Fabric List page.
  3. Click on the resource to check the details on the Cluster Fabric List page. It will move to the Cluster Fabric Details page.
  4. Click the Add button on the right side of the target server on the details tab.
    • The target server addition popup window opens.
      • Cluster Fabric Select a cluster.
      • The GPU Node bound to the selected cluster is retrieved, and you can select the GPU Node to bring in.
      • The selected GPU Node is listed at the bottom with the GPU Node name.
      • Click the Confirm button to complete.
      • Click the Cancel button to cancel the task.
    • Check if the added GPU Node is retrieved in the target server.

Terminating Cluster Fabric

When all GPU Nodes within a Cluster Fabric are terminated, the Cluster Fabric is automatically deleted. For more information, refer to Terminating a GPU Node.

2 - ServiceWatch Agent Install

Users can install the ServiceWatch Agent on the GPU node of a Multi-node GPU Cluster to collect custom metrics and logs.

Reference
Collecting custom metrics/logs via ServiceWatch Agent is currently only available on Samsung Cloud Platform For Enterprise. It will be offered in other offerings in the future.
Caution
Since metric collection via ServiceWatch Agent is classified as custom metrics and incurs charges unlike the default collected metrics, it is recommended to remove or disable unnecessary metric collection settings.

ServiceWatch Agent

The agents that need to be installed on the GPU nodes of a multi-node GPU cluster for collecting ServiceWatch custom metrics and logs can be broadly divided into two types. This is Prometheus Exporter and Open Telemetry Collector.

CategoryDetailed description
Prometheus ExporterProvides metrics of a specific application or service in a format that Prometheus can scrape
  • For collecting OS metrics of a GPU Node, you can use Node Exporter for Linux servers and Windows Exporter for Windows servers depending on the OS type.
Open Telemetry CollectorActs as a centralized collector that gathers telemetry data such as metrics and logs from distributed systems, processes (filtering, sampling, etc.) them, and then exports to various backends (e.g., Prometheus, Jaeger, Elasticsearch, etc.)
  • Exports data to the ServiceWatch Gateway so that ServiceWatch can collect metric and log data.
Table. Description of Prometheus Exporter and Open Telemetry Collector
Notice

If you have configured Kubernetes Engine on a GPU Node, please check GPU metrics through the metrics provided by Kubernetes Engine.

  • If you install the DCGM Exporter on a GPU node where Kubernetes Engine is configured, it may not work properly.
Reference
The ServiceWatch Agent guide for collecting GPU metrics on a GPU Node can be used the same as on a GPU Server. For more details, see GPU Server > ServiceWatch Agent.

3 - Multi-node GPU Cluster Service Scope and Inspection Guide

Multi-node GPU Cluster service scope

In the event of an IaaS HW level issue with the Multi-node GPU Cluster service, technical support can be received through the Support Center’s Contact Us. However, risks due to changes such as OS Kernel updates or application installation are the responsibility of the user, so technical support may be difficult, please be cautious when performing system updates or other tasks.

IaaS HW level problem

  • IPMI(iLO) HW monitoring console where the server’s internal HW fault event occurrence message occurs
  • GPU HW operation error confirmed in nvdia-smi command
  • HW error messages occurring from InfiniBand HCA card or InfiniBand Switch inspection
Caution
Multi-node GPU Cluster is a service sensitive to software version compatibility of Ubuntu OS / NVDIA / Infiniband, so official technical support is not available after changes such as the user’s OS kernel update or application installation.

IaaS HW Inspection Guide

After applying for the Multi-node GPU Cluster service, it is recommended to check the IaaS HW level according to the inspection guide.

OS Kernel and Package Holding

Notice
  • If you do not want automatic updates of package versions, it is recommended to block package updates using the apt-mark command.
  • It is recommended to block the update of Linux kernel or IB related package versions.

To proceed with OS Kernel and Package holding, follow the procedure below.

  1. Use the following command to check the version of the kernel and IB-related packages.
    Color mode
    root@bm-dev-001:~# dpkg -l | egrep -i "kernel | mlnx"
    root@bm-dev-001:~# dpkg -l | egrep -i "kernel | nvidia"
    root@bm-dev-001:~# dpkg -l | egrep -i "kernel | linux-image"
    ii  crash                                 7.2.8-1ubuntu1.20.04.1                  amd64        kernel debugging utility, allowing gdb like syntax
    ii  dkms                                  2.8.1-5ubuntu2                          all          Dynamic Kernel Module Support Framework
    ii  dmeventd                              2:1.02.167-1ubuntu1                     amd64        Linux Kernel Device Mapper event daemon
    ii  dmsetup                               2:1.02.167-1ubuntu1                     amd64        Linux Kernel Device Mapper userspace library
    ii  iser-dkms                             5.4-OFED.5.4.3.0.1.1                    all          DKMS support fo iser kernel modules
    ii  isert-dkms                            5.4-OFED.5.4.3.0.1.1                    all          DKMS support fo isert kernel modules
    ii  kernel-mft-dkms                       4.17.2-12                               all          DKMS support for kernel-mft kernel modules
    ii  kmod                                  27-1ubuntu2                             amd64        tools for managing Linux kernel modules
    ii  knem                                  1.1.4.90mlnx1-OFED.5.1.2.5.0.1          amd64        userspace tools for the KNEM kernel module
    ii  knem-dkms                             1.1.4.90mlnx1-OFED.5.1.2.5.0.1          all          DKMS support for mlnx-ofed kernel modules
    ii  libaio1:amd64                         0.3.112-5                               amd64        Linux kernel AIO access library - shared library
    ii  libdevmapper-event1.02.1:amd64        2:1.02.167-1ubuntu1                     amd64        Linux Kernel Device Mapper event support library
    ii  libdevmapper1.02.1:amd64              2:1.02.167-1ubuntu1                     amd64        Linux Kernel Device Mapper userspace library
    ii  libdrm-amdgpu1:amd64                  2.4.107-8ubuntu1~20.04.2                amd64        Userspace interface to amdgpu-specific kernel DRM services -- runtime
    ii  libdrm-common                         2.4.107-8ubuntu1~20.04.2                all          Userspace interface to kernel DRM services -- common files
    ii  libdrm-intel1:amd64                   2.4.107-8ubuntu1~20.04.2                amd64        Userspace interface to intel-specific kernel DRM services -- runtime
    ii  libdrm-nouveau2:amd64                 2.4.107-8ubuntu1~20.04.2                amd64        Userspace interface to nouveau-specific kernel DRM services -- runtime
    ii  libdrm-radeon1:amd64                  2.4.107-8ubuntu1~20.04.2                amd64        Userspace interface to radeon-specific kernel DRM services -- runtime
    ii  libdrm2:amd64                         2.4.107-8ubuntu1~20.04.2                amd64        Userspace interface to kernel DRM services -- runtime
    ii  linux-firmware                        1.187.29                                all          Firmware for Linux kernel drivers
    hi  linux-generic                         5.4.0.105.109                           amd64        Complete Generic Linux kernel and headers
    ii  linux-headers-5.4.0-104               5.4.0-104.118                           all          Header files related to Linux kernel version 5.4.0
    ii  linux-headers-5.4.0-104-generic       5.4.0-104.118                           amd64        Linux kernel headers for version 5.4.0 on 64 bit x86 SMP
    ii  linux-headers-5.4.0-105               5.4.0-105.119                           all          Header files related to Linux kernel version 5.4.0
    ii  linux-headers-5.4.0-105-generic       5.4.0-105.119                           amd64        Linux kernel headers for version 5.4.0 on 64 bit x86 SMP
    hi  linux-headers-generic                 5.4.0.105.109                           amd64        Generic Linux kernel headers
    ii  linux-image-5.4.0-104-generic         5.4.0-104.118                           amd64        Signed kernel image generic
    ii  linux-image-5.4.0-105-generic         5.4.0-105.119                           amd64        Signed kernel image generic
    hi  linux-image-generic                   5.4.0.105.109                           amd64        Generic Linux kernel image
    ii  linux-libc-dev:amd64                  5.4.0-105.119                           amd64        Linux Kernel Headers for development
    ii  linux-modules-5.4.0-104-generic       5.4.0-104.118                           amd64        Linux kernel extra modules for version 5.4.0 on 64 bit x86 SMP
    ii  linux-modules-5.4.0-105-generic       5.4.0-105.119                           amd64        Linux kernel extra modules for version 5.4.0 on 64 bit x86 SMP
    ii  linux-modules-extra-5.4.0-104-generic 5.4.0-104.118                           amd64        Linux kernel extra modules for version 5.4.0 on 64 bit x86 SMP
    ii  linux-modules-extra-5.4.0-105-generic 5.4.0-105.119                           amd64        Linux kernel extra modules for version 5.4.0 on 64 bit x86 SMP
    ii  mlnx-ofed-kernel-dkms                 5.4-OFED.5.4.3.0.3.1                    all          DKMS support for mlnx-ofed kernel modules
    ii  mlnx-ofed-kernel-utils                5.4-OFED.5.4.3.0.3.1                    amd64        Userspace tools to restart and tune mlnx-ofed kernel modules
    ii  mlnx-tools                            5.2.0-0.54303                           amd64        Userspace tools to restart and tune MLNX_OFED kernel modules
    ii  nvidia-kernel-common-470              470.103.01-0ubuntu0.20.04.1             amd64        Shared files used with the kernel module
    ii  nvidia-kernel-source-470              470.103.01-0ubuntu0.20.04.1             amd64        NVIDIA kernel source package
    ii  nvidia-peer-memory                    1.2-0                                   all          nvidia peer memory kernel module.
    ii  nvidia-peer-memory-dkms               1.2-0                                   all          DKMS support for nvidia-peer-memory kernel modules
    ii  rsyslog                               8.2001.0-1ubuntu1.1                     amd64        reliable system and kernel logging daemon
    ii  srp-dkms                              5.4-OFED.5.4.3.0.1.1                    all          DKMS support fo srp kernel modules
    root@bm-dev-001:~# dpkg -l | egrep -i "kernel | mlnx"
    root@bm-dev-001:~# dpkg -l | egrep -i "kernel | nvidia"
    root@bm-dev-001:~# dpkg -l | egrep -i "kernel | linux-image"
    ii  crash                                 7.2.8-1ubuntu1.20.04.1                  amd64        kernel debugging utility, allowing gdb like syntax
    ii  dkms                                  2.8.1-5ubuntu2                          all          Dynamic Kernel Module Support Framework
    ii  dmeventd                              2:1.02.167-1ubuntu1                     amd64        Linux Kernel Device Mapper event daemon
    ii  dmsetup                               2:1.02.167-1ubuntu1                     amd64        Linux Kernel Device Mapper userspace library
    ii  iser-dkms                             5.4-OFED.5.4.3.0.1.1                    all          DKMS support fo iser kernel modules
    ii  isert-dkms                            5.4-OFED.5.4.3.0.1.1                    all          DKMS support fo isert kernel modules
    ii  kernel-mft-dkms                       4.17.2-12                               all          DKMS support for kernel-mft kernel modules
    ii  kmod                                  27-1ubuntu2                             amd64        tools for managing Linux kernel modules
    ii  knem                                  1.1.4.90mlnx1-OFED.5.1.2.5.0.1          amd64        userspace tools for the KNEM kernel module
    ii  knem-dkms                             1.1.4.90mlnx1-OFED.5.1.2.5.0.1          all          DKMS support for mlnx-ofed kernel modules
    ii  libaio1:amd64                         0.3.112-5                               amd64        Linux kernel AIO access library - shared library
    ii  libdevmapper-event1.02.1:amd64        2:1.02.167-1ubuntu1                     amd64        Linux Kernel Device Mapper event support library
    ii  libdevmapper1.02.1:amd64              2:1.02.167-1ubuntu1                     amd64        Linux Kernel Device Mapper userspace library
    ii  libdrm-amdgpu1:amd64                  2.4.107-8ubuntu1~20.04.2                amd64        Userspace interface to amdgpu-specific kernel DRM services -- runtime
    ii  libdrm-common                         2.4.107-8ubuntu1~20.04.2                all          Userspace interface to kernel DRM services -- common files
    ii  libdrm-intel1:amd64                   2.4.107-8ubuntu1~20.04.2                amd64        Userspace interface to intel-specific kernel DRM services -- runtime
    ii  libdrm-nouveau2:amd64                 2.4.107-8ubuntu1~20.04.2                amd64        Userspace interface to nouveau-specific kernel DRM services -- runtime
    ii  libdrm-radeon1:amd64                  2.4.107-8ubuntu1~20.04.2                amd64        Userspace interface to radeon-specific kernel DRM services -- runtime
    ii  libdrm2:amd64                         2.4.107-8ubuntu1~20.04.2                amd64        Userspace interface to kernel DRM services -- runtime
    ii  linux-firmware                        1.187.29                                all          Firmware for Linux kernel drivers
    hi  linux-generic                         5.4.0.105.109                           amd64        Complete Generic Linux kernel and headers
    ii  linux-headers-5.4.0-104               5.4.0-104.118                           all          Header files related to Linux kernel version 5.4.0
    ii  linux-headers-5.4.0-104-generic       5.4.0-104.118                           amd64        Linux kernel headers for version 5.4.0 on 64 bit x86 SMP
    ii  linux-headers-5.4.0-105               5.4.0-105.119                           all          Header files related to Linux kernel version 5.4.0
    ii  linux-headers-5.4.0-105-generic       5.4.0-105.119                           amd64        Linux kernel headers for version 5.4.0 on 64 bit x86 SMP
    hi  linux-headers-generic                 5.4.0.105.109                           amd64        Generic Linux kernel headers
    ii  linux-image-5.4.0-104-generic         5.4.0-104.118                           amd64        Signed kernel image generic
    ii  linux-image-5.4.0-105-generic         5.4.0-105.119                           amd64        Signed kernel image generic
    hi  linux-image-generic                   5.4.0.105.109                           amd64        Generic Linux kernel image
    ii  linux-libc-dev:amd64                  5.4.0-105.119                           amd64        Linux Kernel Headers for development
    ii  linux-modules-5.4.0-104-generic       5.4.0-104.118                           amd64        Linux kernel extra modules for version 5.4.0 on 64 bit x86 SMP
    ii  linux-modules-5.4.0-105-generic       5.4.0-105.119                           amd64        Linux kernel extra modules for version 5.4.0 on 64 bit x86 SMP
    ii  linux-modules-extra-5.4.0-104-generic 5.4.0-104.118                           amd64        Linux kernel extra modules for version 5.4.0 on 64 bit x86 SMP
    ii  linux-modules-extra-5.4.0-105-generic 5.4.0-105.119                           amd64        Linux kernel extra modules for version 5.4.0 on 64 bit x86 SMP
    ii  mlnx-ofed-kernel-dkms                 5.4-OFED.5.4.3.0.3.1                    all          DKMS support for mlnx-ofed kernel modules
    ii  mlnx-ofed-kernel-utils                5.4-OFED.5.4.3.0.3.1                    amd64        Userspace tools to restart and tune mlnx-ofed kernel modules
    ii  mlnx-tools                            5.2.0-0.54303                           amd64        Userspace tools to restart and tune MLNX_OFED kernel modules
    ii  nvidia-kernel-common-470              470.103.01-0ubuntu0.20.04.1             amd64        Shared files used with the kernel module
    ii  nvidia-kernel-source-470              470.103.01-0ubuntu0.20.04.1             amd64        NVIDIA kernel source package
    ii  nvidia-peer-memory                    1.2-0                                   all          nvidia peer memory kernel module.
    ii  nvidia-peer-memory-dkms               1.2-0                                   all          DKMS support for nvidia-peer-memory kernel modules
    ii  rsyslog                               8.2001.0-1ubuntu1.1                     amd64        reliable system and kernel logging daemon
    ii  srp-dkms                              5.4-OFED.5.4.3.0.1.1                    all          DKMS support fo srp kernel modules
    Code block. Kernel, IB related package version check
  2. Use the apt-mark command to hold the package update.
    Color mode
    # apt-mark hold <package name>
    # apt-mark hold <package name>
    Code block. Package update hold

Intel E810 Driver Update

Check the version of the Intel E810 driver and update it to the recommended version.

Notice

The driver update method is as follows.

  1. Move the basic driver tar file to the desired directory.
Example: /home/username/ice or /usr/local/src/ice
  1. Untar / unzip the Archiver file.

    • x.x.x is the version number of the driver tar file.
      Color mode
      tar zxf ice-x.x.x.tar.gz
      tar zxf ice-x.x.x.tar.gz
      Code block. Unzip file
  2. Change to the driver src directory.

    • x.x.x is the version number of the driver tar file.
      Color mode
      cd ice-x.x.x/src/
      cd ice-x.x.x/src/
      Code block. Directory change
  3. Compile the driver module.

    Color mode
    make install
    make install
    Code Block. Driver Module Compile

  4. After the update is complete, check the version.

    Color mode
    lsmod | grep ice
    modinfo ice | grep version
    lsmod | grep ice
    modinfo ice | grep version
    Code Block. Version Check

NVIDIA driver check

Note
nvidia-smi topo, IB nv_peer_mem status check

To check the NVIDIA driver (nvidia-smi topo, IB nv_peer_mem status) and inspect the IaaS HW level, follow the next procedure.

  1. Check the GPU driver and HW status.

    Color mode
    user@bm-dev-001:~$ nvidia-smi topo -m
            GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    mlx5_0  mlx5_1  mlx5_2  mlx5_3  CPU Affinity    NUMA Affinity
    GPU0     X      NV12    NV12    NV12    NV12    NV12    NV12    NV12    SYS     PXB     SYS     SYS     48-63   3
    GPU1    NV12     X      NV12    NV12    NV12    NV12    NV12    NV12    SYS     PXB     SYS     SYS     48-63   3
    GPU2    NV12    NV12     X      NV12    NV12    NV12    NV12    NV12    PXB     SYS     SYS     SYS     16-31   1
    GPU3    NV12    NV12    NV12     X      NV12    NV12    NV12    NV12    PXB     SYS     SYS     SYS     16-31   1
    GPU4    NV12    NV12    NV12    NV12     X      NV12    NV12    NV12    SYS     SYS     SYS     PXB     112-127 7
    GPU5    NV12    NV12    NV12    NV12    NV12     X      NV12    NV12    SYS     SYS     SYS     PXB     112-127 7
    GPU6    NV12    NV12    NV12    NV12    NV12    NV12     X      NV12    SYS     SYS     PXB     SYS     80-95   5
    GPU7    NV12    NV12    NV12    NV12    NV12    NV12    NV12     X      SYS     SYS     PXB     SYS     80-95   5
    mlx5_0  SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS      X      SYS     SYS     SYS
    mlx5_1  PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      SYS     SYS
    mlx5_2  SYS     SYS     SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS      X      SYS
    mlx5_3  SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS      X
    
    Legend:
    
      X    = Self
      SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
      NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
      PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
      PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
      PIX  = Connection traversing at most a single PCIe bridge
      NV#  = Connection traversing a bonded set of # NVLinks
    user@bm-dev-001:~$ nvidia-smi topo -m
            GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    mlx5_0  mlx5_1  mlx5_2  mlx5_3  CPU Affinity    NUMA Affinity
    GPU0     X      NV12    NV12    NV12    NV12    NV12    NV12    NV12    SYS     PXB     SYS     SYS     48-63   3
    GPU1    NV12     X      NV12    NV12    NV12    NV12    NV12    NV12    SYS     PXB     SYS     SYS     48-63   3
    GPU2    NV12    NV12     X      NV12    NV12    NV12    NV12    NV12    PXB     SYS     SYS     SYS     16-31   1
    GPU3    NV12    NV12    NV12     X      NV12    NV12    NV12    NV12    PXB     SYS     SYS     SYS     16-31   1
    GPU4    NV12    NV12    NV12    NV12     X      NV12    NV12    NV12    SYS     SYS     SYS     PXB     112-127 7
    GPU5    NV12    NV12    NV12    NV12    NV12     X      NV12    NV12    SYS     SYS     SYS     PXB     112-127 7
    GPU6    NV12    NV12    NV12    NV12    NV12    NV12     X      NV12    SYS     SYS     PXB     SYS     80-95   5
    GPU7    NV12    NV12    NV12    NV12    NV12    NV12    NV12     X      SYS     SYS     PXB     SYS     80-95   5
    mlx5_0  SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS      X      SYS     SYS     SYS
    mlx5_1  PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      SYS     SYS
    mlx5_2  SYS     SYS     SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS      X      SYS
    mlx5_3  SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS      X
    
    Legend:
    
      X    = Self
      SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
      NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
      PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
      PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
      PIX  = Connection traversing at most a single PCIe bridge
      NV#  = Connection traversing a bonded set of # NVLinks
    Code Block. GPU Driver and HW Status Check

  2. Check the NVSwitch HW status.

    Color mode
    user@bm-dev-001:~$ nvidia-smi nvlink --status
    GPU 0: NVIDIA A100-SXM4-80GB (UUID: GPU-2c0d1d6b-e348-55fc-44cf-cd65a954b36c)
             Link 0: 25 GB/s
             Link 1: 25 GB/s
             Link 2: 25 GB/s
             Link 3: 25 GB/s
             Link 4: 25 GB/s
             Link 5: 25 GB/s
             Link 6: 25 GB/s
             Link 7: 25 GB/s
             Link 8: 25 GB/s
             Link 9: 25 GB/s
             Link 10: 25 GB/s
             Link 11: 25 GB/s
    GPU 1: NVIDIA A100-SXM4-80GB (UUID: GPU-96f429d8-893a-a9ea-deca-feffd90669e9)
             Link 0: 25 GB/s
             Link 1: 25 GB/s
             Link 2: 25 GB/s
             Link 3: 25 GB/s
             Link 4: 25 GB/s
             Link 5: 25 GB/s
             Link 6: 25 GB/s
             Link 7: 25 GB/s
             Link 8: 25 GB/s
             Link 9: 25 GB/s
             Link 10: 25 GB/s
             Link 11: 25 GB/s
    GPU 2: NVIDIA A100-SXM4-80GB (UUID: GPU-2e601952-b442-b757-a035-725cd320f589)
             Link 0: 25 GB/s
             Link 1: 25 GB/s
             Link 2: 25 GB/s
             Link 3: 25 GB/s
             Link 4: 25 GB/s
             Link 5: 25 GB/s
             Link 6: 25 GB/s
             Link 7: 25 GB/s
             Link 8: 25 GB/s
             Link 9: 25 GB/s
             Link 10: 25 GB/s
             Link 11: 25 GB/s
    GPU 3: NVIDIA A100-SXM4-80GB (UUID: GPU-bcbfd885-a9f8-ec8c-045b-c521472b4fed)
             Link 0: 25 GB/s
             Link 1: 25 GB/s
             Link 2: 25 GB/s
             Link 3: 25 GB/s
             Link 4: 25 GB/s
             Link 5: 25 GB/s
             Link 6: 25 GB/s
             Link 7: 25 GB/s
             Link 8: 25 GB/s
             Link 9: 25 GB/s
             Link 10: 25 GB/s
             Link 11: 25 GB/s
    GPU 4: NVIDIA A100-SXM4-80GB (UUID: GPU-30273090-2d78-fc7a-a360-ec5f871dd488)
             Link 0: 25 GB/s
             Link 1: 25 GB/s
             Link 2: 25 GB/s
             Link 3: 25 GB/s
             Link 4: 25 GB/s
             Link 5: 25 GB/s
             Link 6: 25 GB/s
             Link 7: 25 GB/s
             Link 8: 25 GB/s
             Link 9: 25 GB/s
             Link 10: 25 GB/s
             Link 11: 25 GB/s
    GPU 5: NVIDIA A100-SXM4-80GB (UUID: GPU-5ce7ef61-56dd-fb18-aa7c-be610c8d51c3)
             Link 0: 25 GB/s
             Link 1: 25 GB/s
             Link 2: 25 GB/s
             Link 3: 25 GB/s
             Link 4: 25 GB/s
             Link 5: 25 GB/s
             Link 6: 25 GB/s
             Link 7: 25 GB/s
             Link 8: 25 GB/s
             Link 9: 25 GB/s
             Link 10: 25 GB/s
             Link 11: 25 GB/s
    GPU 6: NVIDIA A100-SXM4-80GB (UUID: GPU-740a527b-b286-8b85-35eb-b6b41c0bb6d7)
             Link 0: 25 GB/s
             Link 1: 25 GB/s
             Link 2: 25 GB/s
             Link 3: 25 GB/s
             Link 4: 25 GB/s
             Link 5: 25 GB/s
             Link 6: 25 GB/s
             Link 7: 25 GB/s
             Link 8: 25 GB/s
             Link 9: 25 GB/s
             Link 10: 25 GB/s
             Link 11: 25 GB/s
    GPU 7: NVIDIA A100-SXM4-80GB (UUID: GPU-1fb6de95-60f6-dbf2-ffca-f7680577e37c)
             Link 0: 25 GB/s
             Link 1: 25 GB/s
             Link 2: 25 GB/s
             Link 3: 25 GB/s
             Link 4: 25 GB/s
             Link 5: 25 GB/s
             Link 6: 25 GB/s
             Link 7: 25 GB/s
             Link 8: 25 GB/s
             Link 9: 25 GB/s
             Link 10: 25 GB/s
             Link 11: 25 GB/s
    user@bm-dev-001:~$ nvidia-smi nvlink --status
    GPU 0: NVIDIA A100-SXM4-80GB (UUID: GPU-2c0d1d6b-e348-55fc-44cf-cd65a954b36c)
             Link 0: 25 GB/s
             Link 1: 25 GB/s
             Link 2: 25 GB/s
             Link 3: 25 GB/s
             Link 4: 25 GB/s
             Link 5: 25 GB/s
             Link 6: 25 GB/s
             Link 7: 25 GB/s
             Link 8: 25 GB/s
             Link 9: 25 GB/s
             Link 10: 25 GB/s
             Link 11: 25 GB/s
    GPU 1: NVIDIA A100-SXM4-80GB (UUID: GPU-96f429d8-893a-a9ea-deca-feffd90669e9)
             Link 0: 25 GB/s
             Link 1: 25 GB/s
             Link 2: 25 GB/s
             Link 3: 25 GB/s
             Link 4: 25 GB/s
             Link 5: 25 GB/s
             Link 6: 25 GB/s
             Link 7: 25 GB/s
             Link 8: 25 GB/s
             Link 9: 25 GB/s
             Link 10: 25 GB/s
             Link 11: 25 GB/s
    GPU 2: NVIDIA A100-SXM4-80GB (UUID: GPU-2e601952-b442-b757-a035-725cd320f589)
             Link 0: 25 GB/s
             Link 1: 25 GB/s
             Link 2: 25 GB/s
             Link 3: 25 GB/s
             Link 4: 25 GB/s
             Link 5: 25 GB/s
             Link 6: 25 GB/s
             Link 7: 25 GB/s
             Link 8: 25 GB/s
             Link 9: 25 GB/s
             Link 10: 25 GB/s
             Link 11: 25 GB/s
    GPU 3: NVIDIA A100-SXM4-80GB (UUID: GPU-bcbfd885-a9f8-ec8c-045b-c521472b4fed)
             Link 0: 25 GB/s
             Link 1: 25 GB/s
             Link 2: 25 GB/s
             Link 3: 25 GB/s
             Link 4: 25 GB/s
             Link 5: 25 GB/s
             Link 6: 25 GB/s
             Link 7: 25 GB/s
             Link 8: 25 GB/s
             Link 9: 25 GB/s
             Link 10: 25 GB/s
             Link 11: 25 GB/s
    GPU 4: NVIDIA A100-SXM4-80GB (UUID: GPU-30273090-2d78-fc7a-a360-ec5f871dd488)
             Link 0: 25 GB/s
             Link 1: 25 GB/s
             Link 2: 25 GB/s
             Link 3: 25 GB/s
             Link 4: 25 GB/s
             Link 5: 25 GB/s
             Link 6: 25 GB/s
             Link 7: 25 GB/s
             Link 8: 25 GB/s
             Link 9: 25 GB/s
             Link 10: 25 GB/s
             Link 11: 25 GB/s
    GPU 5: NVIDIA A100-SXM4-80GB (UUID: GPU-5ce7ef61-56dd-fb18-aa7c-be610c8d51c3)
             Link 0: 25 GB/s
             Link 1: 25 GB/s
             Link 2: 25 GB/s
             Link 3: 25 GB/s
             Link 4: 25 GB/s
             Link 5: 25 GB/s
             Link 6: 25 GB/s
             Link 7: 25 GB/s
             Link 8: 25 GB/s
             Link 9: 25 GB/s
             Link 10: 25 GB/s
             Link 11: 25 GB/s
    GPU 6: NVIDIA A100-SXM4-80GB (UUID: GPU-740a527b-b286-8b85-35eb-b6b41c0bb6d7)
             Link 0: 25 GB/s
             Link 1: 25 GB/s
             Link 2: 25 GB/s
             Link 3: 25 GB/s
             Link 4: 25 GB/s
             Link 5: 25 GB/s
             Link 6: 25 GB/s
             Link 7: 25 GB/s
             Link 8: 25 GB/s
             Link 9: 25 GB/s
             Link 10: 25 GB/s
             Link 11: 25 GB/s
    GPU 7: NVIDIA A100-SXM4-80GB (UUID: GPU-1fb6de95-60f6-dbf2-ffca-f7680577e37c)
             Link 0: 25 GB/s
             Link 1: 25 GB/s
             Link 2: 25 GB/s
             Link 3: 25 GB/s
             Link 4: 25 GB/s
             Link 5: 25 GB/s
             Link 6: 25 GB/s
             Link 7: 25 GB/s
             Link 8: 25 GB/s
             Link 9: 25 GB/s
             Link 10: 25 GB/s
             Link 11: 25 GB/s
    Code block. NVSwitch HW status check

  3. Check the InfiniBand(IB) HCA card HW status and Link.

    Color mode
    user@bm-dev-001:~$ ibdev2netdev -v
    cat: /sys/class/infiniband/mlx5_0/device/vpd: Permission denied
    0000:45:00.0 mlx5_0 (MT4123 -            )                 fw 20.29.1016 port 1 (ACTIVE) ==> ibs18 (Down)
    cat: /sys/class/infiniband/mlx5_1/device/vpd: Permission denied
    0000:0e:00.0 mlx5_1 (MT4123 -            )                 fw 20.29.1016 port 1 (ACTIVE) ==> ibs17 (Down)
    cat: /sys/class/infiniband/mlx5_2/device/vpd: Permission denied
    0000:c5:00.0 mlx5_2 (MT4123 -            )                 fw 20.29.1016 port 1 (ACTIVE) ==> ibs20 (Down)
    cat: /sys/class/infiniband/mlx5_3/device/vpd: Permission denied
    0000:85:00.0 mlx5_3 (MT4123 -            )                 fw 20.29.1016 port 1 (ACTIVE) ==> ibs19 (Down)
    user@bm-dev-001:~$
    
    
    root@bm-dev-001:~# ibstat
    CA 'mlx5_0'
            CA type: MT4123
            Number of ports: 1
            Firmware version: 20.29.1016
            Hardware version: 0
            Node GUID: 0x88e9a4ffff5060ac
            System image GUID: 0x88e9a4ffff5060ac
            Port 1:
                    State: Active
                    Physical state: LinkUp
                    Rate: 200
                    Base lid: 8
                    LMC: 0
                    SM lid: 1
                    Capability mask: 0x2651e848
                    Port GUID: 0x88e9a4ffff5060ac
                    Link layer: InfiniBand
    CA 'mlx5_1'
            CA type: MT4123
            Number of ports: 1
            Firmware version: 20.29.1016
            Hardware version: 0
            Node GUID: 0x88e9a4ffff504080
            System image GUID: 0x88e9a4ffff504080
            Port 1:
                    State: Active
                    Physical state: LinkUp
                    Rate: 200
                    Base lid: 5
                    LMC: 0
                    SM lid: 1
                    Capability mask: 0x2651e848
                    Port GUID: 0x88e9a4ffff504080
                    Link layer: InfiniBand
    CA 'mlx5_2'
            CA type: MT4123
            Number of ports: 1
            Firmware version: 20.29.1016
            Hardware version: 0
            Node GUID: 0x88e9a4ffff505038
            System image GUID: 0x88e9a4ffff505038
            Port 1:
                    State: Active
                    Physical state: LinkUp
                    Rate: 200
                    Base lid: 2
                    LMC: 0
                    SM lid: 1
                    Capability mask: 0x2651e848
                    Port GUID: 0x88e9a4ffff505038
                    Link layer: InfiniBand
    CA 'mlx5_3'
            CA type: MT4123
            Number of ports: 1
            Firmware version: 20.29.1016
            Hardware version: 0
            Node GUID: 0x88e9a4ffff504094
            System image GUID: 0x88e9a4ffff504094
            Port 1:
                    State: Active
                    Physical state: LinkUp
                    Rate: 200
                    Base lid: 7
                    LMC: 0
                    SM lid: 1
                    Capability mask: 0x2651e848
                    Port GUID: 0x88e9a4ffff504094
                    Link layer: InfiniBand
    user@bm-dev-001:~$ ibdev2netdev -v
    cat: /sys/class/infiniband/mlx5_0/device/vpd: Permission denied
    0000:45:00.0 mlx5_0 (MT4123 -            )                 fw 20.29.1016 port 1 (ACTIVE) ==> ibs18 (Down)
    cat: /sys/class/infiniband/mlx5_1/device/vpd: Permission denied
    0000:0e:00.0 mlx5_1 (MT4123 -            )                 fw 20.29.1016 port 1 (ACTIVE) ==> ibs17 (Down)
    cat: /sys/class/infiniband/mlx5_2/device/vpd: Permission denied
    0000:c5:00.0 mlx5_2 (MT4123 -            )                 fw 20.29.1016 port 1 (ACTIVE) ==> ibs20 (Down)
    cat: /sys/class/infiniband/mlx5_3/device/vpd: Permission denied
    0000:85:00.0 mlx5_3 (MT4123 -            )                 fw 20.29.1016 port 1 (ACTIVE) ==> ibs19 (Down)
    user@bm-dev-001:~$
    
    
    root@bm-dev-001:~# ibstat
    CA 'mlx5_0'
            CA type: MT4123
            Number of ports: 1
            Firmware version: 20.29.1016
            Hardware version: 0
            Node GUID: 0x88e9a4ffff5060ac
            System image GUID: 0x88e9a4ffff5060ac
            Port 1:
                    State: Active
                    Physical state: LinkUp
                    Rate: 200
                    Base lid: 8
                    LMC: 0
                    SM lid: 1
                    Capability mask: 0x2651e848
                    Port GUID: 0x88e9a4ffff5060ac
                    Link layer: InfiniBand
    CA 'mlx5_1'
            CA type: MT4123
            Number of ports: 1
            Firmware version: 20.29.1016
            Hardware version: 0
            Node GUID: 0x88e9a4ffff504080
            System image GUID: 0x88e9a4ffff504080
            Port 1:
                    State: Active
                    Physical state: LinkUp
                    Rate: 200
                    Base lid: 5
                    LMC: 0
                    SM lid: 1
                    Capability mask: 0x2651e848
                    Port GUID: 0x88e9a4ffff504080
                    Link layer: InfiniBand
    CA 'mlx5_2'
            CA type: MT4123
            Number of ports: 1
            Firmware version: 20.29.1016
            Hardware version: 0
            Node GUID: 0x88e9a4ffff505038
            System image GUID: 0x88e9a4ffff505038
            Port 1:
                    State: Active
                    Physical state: LinkUp
                    Rate: 200
                    Base lid: 2
                    LMC: 0
                    SM lid: 1
                    Capability mask: 0x2651e848
                    Port GUID: 0x88e9a4ffff505038
                    Link layer: InfiniBand
    CA 'mlx5_3'
            CA type: MT4123
            Number of ports: 1
            Firmware version: 20.29.1016
            Hardware version: 0
            Node GUID: 0x88e9a4ffff504094
            System image GUID: 0x88e9a4ffff504094
            Port 1:
                    State: Active
                    Physical state: LinkUp
                    Rate: 200
                    Base lid: 7
                    LMC: 0
                    SM lid: 1
                    Capability mask: 0x2651e848
                    Port GUID: 0x88e9a4ffff504094
                    Link layer: InfiniBand
    Code block. InfiniBand(IB) HCA card HW status and Link check

IB bandwidth communication check

To check the IB bandwidth communication status (ib_send_bw) and inspect the IaaS HW level, follow these steps.

  1. Check the name of the IB HCA interface.

    Color mode
    user@bm-dev-001:~$ ibdev2netdev
    mlx5_0 port 1 ==> ibs18 (Down)
    mlx5_1 port 1 ==> ibs17 (Down)
    mlx5_2 port 1 ==> ibs20 (Down)
    mlx5_3 port 1 ==> ibs19 (Down)
    user@bm-dev-001:~$ ibdev2netdev
    mlx5_0 port 1 ==> ibs18 (Down)
    mlx5_1 port 1 ==> ibs17 (Down)
    mlx5_2 port 1 ==> ibs20 (Down)
    mlx5_3 port 1 ==> ibs19 (Down)
    Code block. Check the name of IB HCA interface

  2. Check the HCA interface that can communicate with IB Switch#1.

    Color mode
    mlx5_0 port 1 ==> ibs18 (Down)
    mlx5_2 port 1 ==> ibs20 (Down)
    mlx5_0 port 1 ==> ibs18 (Down)
    mlx5_2 port 1 ==> ibs20 (Down)
    Code Block. HCA Interface Check

  3. Check the HCA interface that can communicate with IB Switch#2.

    Color mode
    mlx5_1 port 1 ==> ibs17 (Down)
    mlx5_3 port 1 ==> ibs19 (Down)
    mlx5_1 port 1 ==> ibs17 (Down)
    mlx5_3 port 1 ==> ibs19 (Down)
    Code Block. HCA Interface Check

  4. Use SERVER Side commands to check the communication status.

    • Client Side command is entered secondarily for mutual communication
      Color mode
      user@bm-dev-001:~$ ib_send_bw -d mlx5_3 -i 1 –F
      ************************************
      * Waiting for client to connect... *
      ************************************
      ---------------------------------------------------------------------------------------
                          Send BW Test
       Dual-port       : OFF          Device         : mlx5_3
       Number of qps   : 1            Transport type : IB
       Connection type : RC           Using SRQ      : OFF
       PCIe relax order: ON
       ibv_wr* API     : ON
       RX depth        : 512
       CQ Moderation   : 1
       Mtu             : 4096[B]
       Link type       : IB
       Max inline data : 0[B]
       rdma_cm QPs     : OFF
       Data ex. method : Ethernet
      ---------------------------------------------------------------------------------------
       local address: LID 0x07 QPN 0x002e PSN 0xa86622
       remote address: LID 0x0a QPN 0x002d PSN 0xfc58dd
      ---------------------------------------------------------------------------------------
       #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
       65536      1000             0.00               19827.40                   0.317238
      ---------------------------------------------------------------------------------------
      user@bm-dev-001:~$ ib_send_bw -d mlx5_3 -i 1 –F
      ************************************
      * Waiting for client to connect... *
      ************************************
      ---------------------------------------------------------------------------------------
                          Send BW Test
       Dual-port       : OFF          Device         : mlx5_3
       Number of qps   : 1            Transport type : IB
       Connection type : RC           Using SRQ      : OFF
       PCIe relax order: ON
       ibv_wr* API     : ON
       RX depth        : 512
       CQ Moderation   : 1
       Mtu             : 4096[B]
       Link type       : IB
       Max inline data : 0[B]
       rdma_cm QPs     : OFF
       Data ex. method : Ethernet
      ---------------------------------------------------------------------------------------
       local address: LID 0x07 QPN 0x002e PSN 0xa86622
       remote address: LID 0x0a QPN 0x002d PSN 0xfc58dd
      ---------------------------------------------------------------------------------------
       #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
       65536      1000             0.00               19827.40                   0.317238
      ---------------------------------------------------------------------------------------
      Code Block. Communication Status Check
  5. Use the CLIENT Side command to check the communication status.

    • SERVER Side command is entered first for mutual communication
      Color mode
      root@bm-dev-003:~# ib_send_bw -d mlx5_3 -i 1 -F <SERVER Side IP>
      ---------------------------------------------------------------------------------------
                          Send BW Test
       Dual-port       : OFF          Device         : mlx5_3
       Number of qps   : 1            Transport type : IB
       Connection type : RC           Using SRQ      : OFF
       PCIe relax order: ON
       ibv_wr* API     : ON
       TX depth        : 128
       CQ Moderation   : 1
       Mtu             : 4096[B]
       Link type       : IB
       Max inline data : 0[B]
       rdma_cm QPs     : OFF
       Data ex. method : Ethernet
      ---------------------------------------------------------------------------------------
       local address: LID 0x0a QPN 0x002a PSN 0x98a48e
       remote address: LID 0x07 QPN 0x002c PSN 0xe68304
      ---------------------------------------------------------------------------------------
       #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
       65536      1000             19008.49            19006.37                  0.304102
      ---------------------------------------------------------------------------------------
      root@bm-dev-003:~# ib_send_bw -d mlx5_3 -i 1 -F <SERVER Side IP>
      ---------------------------------------------------------------------------------------
                          Send BW Test
       Dual-port       : OFF          Device         : mlx5_3
       Number of qps   : 1            Transport type : IB
       Connection type : RC           Using SRQ      : OFF
       PCIe relax order: ON
       ibv_wr* API     : ON
       TX depth        : 128
       CQ Moderation   : 1
       Mtu             : 4096[B]
       Link type       : IB
       Max inline data : 0[B]
       rdma_cm QPs     : OFF
       Data ex. method : Ethernet
      ---------------------------------------------------------------------------------------
       local address: LID 0x0a QPN 0x002a PSN 0x98a48e
       remote address: LID 0x07 QPN 0x002c PSN 0xe68304
      ---------------------------------------------------------------------------------------
       #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
       65536      1000             19008.49            19006.37                  0.304102
      ---------------------------------------------------------------------------------------
      Code Block. Communication Status Check

Check the IB service-related kernel modules (lsmod) to inspect the IaaS HW level.

Color mode
user@bm-dev-001:~$ lsmod | grep nv_peer_mem
nv_peer_mem            16384  0
ib_core               315392  9 rdma_cm,ib_ipoib,nv_peer_mem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
nvidia              35315712  156 nvidia_uvm,nv_peer_mem,nvidia_modeset
user@bm-dev-001:~$ lsmod | grep nv_peer_mem
nv_peer_mem            16384  0
ib_core               315392  9 rdma_cm,ib_ipoib,nv_peer_mem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
nvidia              35315712  156 nvidia_uvm,nv_peer_mem,nvidia_modeset
Code block. IB service related kernel module check(1)
Color mode
user@bm-dev-001:~$ service nv_peer_mem status
  nv_peer_mem.service - LSB: Activates/Deactivates nv_peer_mem to \ start at boot time.
     Loaded: loaded (/etc/init.d/nv_peer_mem; generated)
     Active: active (exited) since Mon 2023-03-13 16:21:33 KST; 2 days ago
       Docs: man:systemd-sysv-generator(8)
    Process: 4913 ExecStart=/etc/init.d/nv_peer_mem start (code=exited, status=0/SUCCESS)
user@bm-dev-001:~$ service nv_peer_mem status
  nv_peer_mem.service - LSB: Activates/Deactivates nv_peer_mem to \ start at boot time.
     Loaded: loaded (/etc/init.d/nv_peer_mem; generated)
     Active: active (exited) since Mon 2023-03-13 16:21:33 KST; 2 days ago
       Docs: man:systemd-sysv-generator(8)
    Process: 4913 ExecStart=/etc/init.d/nv_peer_mem start (code=exited, status=0/SUCCESS)
Code block. IB service-related kernel module check(2)
Color mode
user@bm-dev-001:~$ lsmod | grep ib
libiscsi_tcp           32768  1 iscsi_tcp
libiscsi               57344  2 libiscsi_tcp,iscsi_tcp
scsi_transport_iscsi   110592  4 libiscsi_tcp,iscsi_tcp,libiscsi
ib_ipoib              131072  0
ib_cm                  57344  2 rdma_cm,ib_ipoib
ib_umad                24576  8
mlx5_ib               380928  0
ib_uverbs             135168  18 rdma_ucm,mlx5_ib
ib_core               315392  9 rdma_cm,ib_ipoib,nv_peer_mem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
libcrc32c              16384  2 btrfs,raid456
mlx5_core            1458176  1 mlx5_ib
auxiliary              16384  2 mlx5_ib,mlx5_core
mlx_compat             65536  12 rdma_cm,ib_ipoib,mlxdevm,iw_cm,auxiliary,ib_umad,ib_core,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm,mlx5_core
user@bm-dev-001:~$ lsmod | grep ib
libiscsi_tcp           32768  1 iscsi_tcp
libiscsi               57344  2 libiscsi_tcp,iscsi_tcp
scsi_transport_iscsi   110592  4 libiscsi_tcp,iscsi_tcp,libiscsi
ib_ipoib              131072  0
ib_cm                  57344  2 rdma_cm,ib_ipoib
ib_umad                24576  8
mlx5_ib               380928  0
ib_uverbs             135168  18 rdma_ucm,mlx5_ib
ib_core               315392  9 rdma_cm,ib_ipoib,nv_peer_mem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
libcrc32c              16384  2 btrfs,raid456
mlx5_core            1458176  1 mlx5_ib
auxiliary              16384  2 mlx5_ib,mlx5_core
mlx_compat             65536  12 rdma_cm,ib_ipoib,mlxdevm,iw_cm,auxiliary,ib_umad,ib_core,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm,mlx5_core
Code block. IB service-related kernel module check(3)

Storage Physical Disk Resources and Multi-Path Check

Check the IaaS HW level by checking the storage physical disk resources and Multi-Path.

Color mode
root@bm-dev-002:/tmp# fdisk –l
root@bm-dev-002:/tmp# fdisk –l
Code block. Storage physical disk resource check result
Color mode
root@bm-dev-002:/tmp# multipath –ll
root@bm-dev-002:/tmp# multipath –ll
Code Block. Multi-Path Verification Result

Multi-node GPU Cluster new deployment after checking Service Network

Use the following command to check if the MII Status of Bonding and Slave Interface is up.

  • command

    Color mode
    root@mngc-001:~# cat /proc/net/bonding/bond-srv
    Ethernet Channel Bonding Driver: v5.15.0-25-generic
    root@mngc-001:~# cat /proc/net/bonding/bond-srv
    Ethernet Channel Bonding Driver: v5.15.0-25-generic
    Code Block. Service Network Check Command

  • confirmation result

    Color mode
    Bonding Mode: fault-tolerance (active-backup)
    Primary Slave: None
    Currently Active Slave: ens9f0
    MII Status: up
    MII Polling Interval (ms): 100
    Up Delay (ms): 0
    Down Delay (ms): 0
    Peer Notification Delay (ms): 0
    
    Slave Interface: ens9f0
    MII Status: up
    Speed: 100000 Mbps
    Duplex: full
    Link Failure Count: 0
    Permanent HW addr: 30:3e:a7:02:35:70
    Slave queue ID: 0
    
    Slave Interface: ens11f0
    MII Status: up
    Speed: 100000 Mbps
    Duplex: full
    Link Failure Count: 0
    Permanent HW addr: 30:3e:a7:02:2f:e8
    Slave queue ID: 0
    Bonding Mode: fault-tolerance (active-backup)
    Primary Slave: None
    Currently Active Slave: ens9f0
    MII Status: up
    MII Polling Interval (ms): 100
    Up Delay (ms): 0
    Down Delay (ms): 0
    Peer Notification Delay (ms): 0
    
    Slave Interface: ens9f0
    MII Status: up
    Speed: 100000 Mbps
    Duplex: full
    Link Failure Count: 0
    Permanent HW addr: 30:3e:a7:02:35:70
    Slave queue ID: 0
    
    Slave Interface: ens11f0
    MII Status: up
    Speed: 100000 Mbps
    Duplex: full
    Link Failure Count: 0
    Permanent HW addr: 30:3e:a7:02:2f:e8
    Slave queue ID: 0
    Code Block. Service Network Check Command Result

Reference
If some Slave Interface is in a down state, please use the Support Center’s Contact Us to report the abnormal situation and take action.

Multi-node GPU Cluster new deployment after checking Time Server and time synchronization

The OS image has the chrony daemon installed and set to synchronize with the SCP NTP server. Use the following command to check if there are any lines marked with ^* in the MS Name column.

  • command

    Color mode
    root@mngc-001:~# chronyc sources -V
    root@mngc-001:~# chronyc sources -V
    Code Block. chrony daemon installation command

  • confirmation result

    Color mode
    MS Name/IP address         Stratum Poll Reach LastRx Last sample
    ===============================================================================
    ^+ 198.19.1.53                   4  10   377  1040    -16us[  -37us] +/- 9982us
    ^* 198.19.1.54                   4  10   377   312   -367us[ -388us] +/-   13ms
    MS Name/IP address         Stratum Poll Reach LastRx Last sample
    ===============================================================================
    ^+ 198.19.1.53                   4  10   377  1040    -16us[  -37us] +/- 9982us
    ^* 198.19.1.54                   4  10   377   312   -367us[ -388us] +/-   13ms
    Code block. chrony daemon installation check result

GPU MIG/ECC Setting Initialization Check Guide

When applying for a multi-node GPU cluster product, the GPU MIG/ECC setting is initialized. However, to apply the exact setting value, please restart it once at the beginning, and then check and use it according to the inspection guide to see if the setting value is applied.

Reference
  • MIG: Multi-Instance GPU
  • ECC: Error Correction Code

MIG Setup Initialization

Refer to the following for how to check and initialize MIG settings.

Use the following command to check if the status value of MIG M is Disabled.

  • command

    Color mode
    root@bm-dev-001:~#nvidia-smi
    root@bm-dev-001:~#nvidia-smi
    Code Block. MIG M. Initialize Settings

  • confirmation result

    Color mode
    +-----------------------------------------------------------------------------------------+
    |  NVIDIA-SMI 470.129.06        Driver version: 470.129.06        CUDA Version: 11.4      |
    |----------------------------------+-----------------------------+------------------------|
    |  GPU  Name        Persistence-M  |  Bus-Id             Disp.A  |  Volatile Uncorr. ECC  |
    |  Fan  Temp  Perf  Pwr:Usage/Cap  |               Memory-Usage  |  GPU-Util  Compute M.  |
    |                                  |                             |                MIG M.  |
    |==================================+=============================+========================|
    |    0  NVIDIA A100-SXM...    Off  |  00000000:03:00.0      Off  |                   Off  |
    |  N/A  29C     P0    57W  /  400W |          0MiB  /  81251MiB  |    0%         Default  |
    |                                  |                             |              Disabled  |
    +----------------------------------+-----------------------------+------------------------+
    |    0  NVIDIA A100-SXM...    Off  |  00000000:0C:00.0      Off  |                   Off  |
    |  N/A  30C     P0    58W  /  400W |          0MiB  /  81251MiB  |    18%        Default  |
    |                                  |                             |              Disabled  |
    +-----------------------------------------------------------------------------------------+
    +-----------------------------------------------------------------------------------------+
    |  NVIDIA-SMI 470.129.06        Driver version: 470.129.06        CUDA Version: 11.4      |
    |----------------------------------+-----------------------------+------------------------|
    |  GPU  Name        Persistence-M  |  Bus-Id             Disp.A  |  Volatile Uncorr. ECC  |
    |  Fan  Temp  Perf  Pwr:Usage/Cap  |               Memory-Usage  |  GPU-Util  Compute M.  |
    |                                  |                             |                MIG M.  |
    |==================================+=============================+========================|
    |    0  NVIDIA A100-SXM...    Off  |  00000000:03:00.0      Off  |                   Off  |
    |  N/A  29C     P0    57W  /  400W |          0MiB  /  81251MiB  |    0%         Default  |
    |                                  |                             |              Disabled  |
    +----------------------------------+-----------------------------+------------------------+
    |    0  NVIDIA A100-SXM...    Off  |  00000000:0C:00.0      Off  |                   Off  |
    |  N/A  30C     P0    58W  /  400W |          0MiB  /  81251MiB  |    18%        Default  |
    |                                  |                             |              Disabled  |
    +-----------------------------------------------------------------------------------------+
    Code Block. MIG M. Initialization Setting Check Result

  • If MIG M.’s status value is not Disabled, use the following command to initialize MIG.

    Color mode
    root@bm-dev-001:~# nvidia-smi -mig 0
    root@bm-dev-001:~# nvidia-smi --gpu-reset
    root@bm-dev-001:~# nvidia-smi -mig 0
    root@bm-dev-001:~# nvidia-smi --gpu-reset
    Code Block. MIG M. Status Value Initialization

ECC Setting Initialization

Refer to the following for how to check and initialize the ECC settings.

Use the following command to check if the status value of Volatile Uncorr. ECC is Off.

  • command

    Color mode
    root@bm-dev-001:~#nvidia-smi
    root@bm-dev-001:~#nvidia-smi
    Code Block. ECC Setting Command

  • confirmation result

    Color mode
    +-----------------------------------------------------------------------------------------+
    |  NVIDIA-SMI 470.129.06        Driver version: 470.129.06        CUDA Version: 11.4      |
    |----------------------------------+-----------------------------+------------------------|
    |  GPU  Name        Persistence-M  |  Bus-Id             Disp.A  |  Volatile Uncorr. ECC  |
    |  Fan  Temp  Perf  Pwr:Usage/Cap  |               Memory-Usage  |  GPU-Util  Compute M.  |
    |                                  |                             |                MIG M.  |
    |==================================+=============================+========================|
    |    0  NVIDIA A100-SXM...    Off  |  00000000:03:00.0      Off  |                   Off  |
    |  N/A  29C     P0    57W  /  400W |          0MiB  /  81251MiB  |    0%         Default  |
    |                                  |                             |              Disabled  |
    +----------------------------------+-----------------------------+------------------------+
    |    0  NVIDIA A100-SXM...    Off  |  00000000:0C:00.0      Off  |                   Off  |
    |  N/A  30C     P0    61W  /  400W |          0MiB  /  81251MiB  |    18%        Default  |
    |                                  |                             |              Disabled  |
    +-----------------------------------------------------------------------------------------+
    +-----------------------------------------------------------------------------------------+
    |  NVIDIA-SMI 470.129.06        Driver version: 470.129.06        CUDA Version: 11.4      |
    |----------------------------------+-----------------------------+------------------------|
    |  GPU  Name        Persistence-M  |  Bus-Id             Disp.A  |  Volatile Uncorr. ECC  |
    |  Fan  Temp  Perf  Pwr:Usage/Cap  |               Memory-Usage  |  GPU-Util  Compute M.  |
    |                                  |                             |                MIG M.  |
    |==================================+=============================+========================|
    |    0  NVIDIA A100-SXM...    Off  |  00000000:03:00.0      Off  |                   Off  |
    |  N/A  29C     P0    57W  /  400W |          0MiB  /  81251MiB  |    0%         Default  |
    |                                  |                             |              Disabled  |
    +----------------------------------+-----------------------------+------------------------+
    |    0  NVIDIA A100-SXM...    Off  |  00000000:0C:00.0      Off  |                   Off  |
    |  N/A  30C     P0    61W  /  400W |          0MiB  /  81251MiB  |    18%        Default  |
    |                                  |                             |              Disabled  |
    +-----------------------------------------------------------------------------------------+
    Code Block. ECC Setting Check Result

  • Volatile Uncorr. ECC’s status value is On*, please proceed with rebooting.

  • Volatile Uncorr. ECC status value is not On* or Off, use the following command to initialize ECC. After initialization, reboot and check if the status value is Off.

Color mode
root@bm-dev-001:~# nvidia-smi --ecc-config=0
root@bm-dev-001:~# nvidia-smi --ecc-config=0
Code Block. ECC Status Value Check