The page has been translated by Gen AI.

Multi-node GPU Cluster Service Scope and Inspection Guide

Multi-node GPU Cluster service scope

In the event of an IaaS HW level issue with the Multi-node GPU Cluster service, technical support can be received through the Support Center’s Contact Us. However, risks due to changes such as OS Kernel updates or application installation are the responsibility of the user, so technical support may be difficult, please be cautious when performing system updates or other tasks.

IaaS HW level problem

  • IPMI(iLO) HW monitoring console where the server’s internal HW fault event occurrence message occurs
  • GPU HW operation error confirmed in nvdia-smi command
  • HW error messages occurring from InfiniBand HCA card or InfiniBand Switch inspection
Caution
Multi-node GPU Cluster is a service sensitive to software version compatibility of Ubuntu OS / NVDIA / Infiniband, so official technical support is not available after changes such as the user’s OS kernel update or application installation.

IaaS HW Inspection Guide

After applying for the Multi-node GPU Cluster service, it is recommended to check the IaaS HW level according to the inspection guide.

OS Kernel and Package Holding

Notice
  • If you do not want automatic updates of package versions, it is recommended to block package updates using the apt-mark command.
  • It is recommended to block the update of Linux kernel or IB related package versions.

To proceed with OS Kernel and Package holding, follow the procedure below.

  1. Use the following command to check the version of the kernel and IB-related packages.
    Color mode
    root@bm-dev-001:~# dpkg -l | egrep -i "kernel | mlnx"
    root@bm-dev-001:~# dpkg -l | egrep -i "kernel | nvidia"
    root@bm-dev-001:~# dpkg -l | egrep -i "kernel | linux-image"
    ii  crash                                 7.2.8-1ubuntu1.20.04.1                  amd64        kernel debugging utility, allowing gdb like syntax
    ii  dkms                                  2.8.1-5ubuntu2                          all          Dynamic Kernel Module Support Framework
    ii  dmeventd                              2:1.02.167-1ubuntu1                     amd64        Linux Kernel Device Mapper event daemon
    ii  dmsetup                               2:1.02.167-1ubuntu1                     amd64        Linux Kernel Device Mapper userspace library
    ii  iser-dkms                             5.4-OFED.5.4.3.0.1.1                    all          DKMS support fo iser kernel modules
    ii  isert-dkms                            5.4-OFED.5.4.3.0.1.1                    all          DKMS support fo isert kernel modules
    ii  kernel-mft-dkms                       4.17.2-12                               all          DKMS support for kernel-mft kernel modules
    ii  kmod                                  27-1ubuntu2                             amd64        tools for managing Linux kernel modules
    ii  knem                                  1.1.4.90mlnx1-OFED.5.1.2.5.0.1          amd64        userspace tools for the KNEM kernel module
    ii  knem-dkms                             1.1.4.90mlnx1-OFED.5.1.2.5.0.1          all          DKMS support for mlnx-ofed kernel modules
    ii  libaio1:amd64                         0.3.112-5                               amd64        Linux kernel AIO access library - shared library
    ii  libdevmapper-event1.02.1:amd64        2:1.02.167-1ubuntu1                     amd64        Linux Kernel Device Mapper event support library
    ii  libdevmapper1.02.1:amd64              2:1.02.167-1ubuntu1                     amd64        Linux Kernel Device Mapper userspace library
    ii  libdrm-amdgpu1:amd64                  2.4.107-8ubuntu1~20.04.2                amd64        Userspace interface to amdgpu-specific kernel DRM services -- runtime
    ii  libdrm-common                         2.4.107-8ubuntu1~20.04.2                all          Userspace interface to kernel DRM services -- common files
    ii  libdrm-intel1:amd64                   2.4.107-8ubuntu1~20.04.2                amd64        Userspace interface to intel-specific kernel DRM services -- runtime
    ii  libdrm-nouveau2:amd64                 2.4.107-8ubuntu1~20.04.2                amd64        Userspace interface to nouveau-specific kernel DRM services -- runtime
    ii  libdrm-radeon1:amd64                  2.4.107-8ubuntu1~20.04.2                amd64        Userspace interface to radeon-specific kernel DRM services -- runtime
    ii  libdrm2:amd64                         2.4.107-8ubuntu1~20.04.2                amd64        Userspace interface to kernel DRM services -- runtime
    ii  linux-firmware                        1.187.29                                all          Firmware for Linux kernel drivers
    hi  linux-generic                         5.4.0.105.109                           amd64        Complete Generic Linux kernel and headers
    ii  linux-headers-5.4.0-104               5.4.0-104.118                           all          Header files related to Linux kernel version 5.4.0
    ii  linux-headers-5.4.0-104-generic       5.4.0-104.118                           amd64        Linux kernel headers for version 5.4.0 on 64 bit x86 SMP
    ii  linux-headers-5.4.0-105               5.4.0-105.119                           all          Header files related to Linux kernel version 5.4.0
    ii  linux-headers-5.4.0-105-generic       5.4.0-105.119                           amd64        Linux kernel headers for version 5.4.0 on 64 bit x86 SMP
    hi  linux-headers-generic                 5.4.0.105.109                           amd64        Generic Linux kernel headers
    ii  linux-image-5.4.0-104-generic         5.4.0-104.118                           amd64        Signed kernel image generic
    ii  linux-image-5.4.0-105-generic         5.4.0-105.119                           amd64        Signed kernel image generic
    hi  linux-image-generic                   5.4.0.105.109                           amd64        Generic Linux kernel image
    ii  linux-libc-dev:amd64                  5.4.0-105.119                           amd64        Linux Kernel Headers for development
    ii  linux-modules-5.4.0-104-generic       5.4.0-104.118                           amd64        Linux kernel extra modules for version 5.4.0 on 64 bit x86 SMP
    ii  linux-modules-5.4.0-105-generic       5.4.0-105.119                           amd64        Linux kernel extra modules for version 5.4.0 on 64 bit x86 SMP
    ii  linux-modules-extra-5.4.0-104-generic 5.4.0-104.118                           amd64        Linux kernel extra modules for version 5.4.0 on 64 bit x86 SMP
    ii  linux-modules-extra-5.4.0-105-generic 5.4.0-105.119                           amd64        Linux kernel extra modules for version 5.4.0 on 64 bit x86 SMP
    ii  mlnx-ofed-kernel-dkms                 5.4-OFED.5.4.3.0.3.1                    all          DKMS support for mlnx-ofed kernel modules
    ii  mlnx-ofed-kernel-utils                5.4-OFED.5.4.3.0.3.1                    amd64        Userspace tools to restart and tune mlnx-ofed kernel modules
    ii  mlnx-tools                            5.2.0-0.54303                           amd64        Userspace tools to restart and tune MLNX_OFED kernel modules
    ii  nvidia-kernel-common-470              470.103.01-0ubuntu0.20.04.1             amd64        Shared files used with the kernel module
    ii  nvidia-kernel-source-470              470.103.01-0ubuntu0.20.04.1             amd64        NVIDIA kernel source package
    ii  nvidia-peer-memory                    1.2-0                                   all          nvidia peer memory kernel module.
    ii  nvidia-peer-memory-dkms               1.2-0                                   all          DKMS support for nvidia-peer-memory kernel modules
    ii  rsyslog                               8.2001.0-1ubuntu1.1                     amd64        reliable system and kernel logging daemon
    ii  srp-dkms                              5.4-OFED.5.4.3.0.1.1                    all          DKMS support fo srp kernel modules
    root@bm-dev-001:~# dpkg -l | egrep -i "kernel | mlnx"
    root@bm-dev-001:~# dpkg -l | egrep -i "kernel | nvidia"
    root@bm-dev-001:~# dpkg -l | egrep -i "kernel | linux-image"
    ii  crash                                 7.2.8-1ubuntu1.20.04.1                  amd64        kernel debugging utility, allowing gdb like syntax
    ii  dkms                                  2.8.1-5ubuntu2                          all          Dynamic Kernel Module Support Framework
    ii  dmeventd                              2:1.02.167-1ubuntu1                     amd64        Linux Kernel Device Mapper event daemon
    ii  dmsetup                               2:1.02.167-1ubuntu1                     amd64        Linux Kernel Device Mapper userspace library
    ii  iser-dkms                             5.4-OFED.5.4.3.0.1.1                    all          DKMS support fo iser kernel modules
    ii  isert-dkms                            5.4-OFED.5.4.3.0.1.1                    all          DKMS support fo isert kernel modules
    ii  kernel-mft-dkms                       4.17.2-12                               all          DKMS support for kernel-mft kernel modules
    ii  kmod                                  27-1ubuntu2                             amd64        tools for managing Linux kernel modules
    ii  knem                                  1.1.4.90mlnx1-OFED.5.1.2.5.0.1          amd64        userspace tools for the KNEM kernel module
    ii  knem-dkms                             1.1.4.90mlnx1-OFED.5.1.2.5.0.1          all          DKMS support for mlnx-ofed kernel modules
    ii  libaio1:amd64                         0.3.112-5                               amd64        Linux kernel AIO access library - shared library
    ii  libdevmapper-event1.02.1:amd64        2:1.02.167-1ubuntu1                     amd64        Linux Kernel Device Mapper event support library
    ii  libdevmapper1.02.1:amd64              2:1.02.167-1ubuntu1                     amd64        Linux Kernel Device Mapper userspace library
    ii  libdrm-amdgpu1:amd64                  2.4.107-8ubuntu1~20.04.2                amd64        Userspace interface to amdgpu-specific kernel DRM services -- runtime
    ii  libdrm-common                         2.4.107-8ubuntu1~20.04.2                all          Userspace interface to kernel DRM services -- common files
    ii  libdrm-intel1:amd64                   2.4.107-8ubuntu1~20.04.2                amd64        Userspace interface to intel-specific kernel DRM services -- runtime
    ii  libdrm-nouveau2:amd64                 2.4.107-8ubuntu1~20.04.2                amd64        Userspace interface to nouveau-specific kernel DRM services -- runtime
    ii  libdrm-radeon1:amd64                  2.4.107-8ubuntu1~20.04.2                amd64        Userspace interface to radeon-specific kernel DRM services -- runtime
    ii  libdrm2:amd64                         2.4.107-8ubuntu1~20.04.2                amd64        Userspace interface to kernel DRM services -- runtime
    ii  linux-firmware                        1.187.29                                all          Firmware for Linux kernel drivers
    hi  linux-generic                         5.4.0.105.109                           amd64        Complete Generic Linux kernel and headers
    ii  linux-headers-5.4.0-104               5.4.0-104.118                           all          Header files related to Linux kernel version 5.4.0
    ii  linux-headers-5.4.0-104-generic       5.4.0-104.118                           amd64        Linux kernel headers for version 5.4.0 on 64 bit x86 SMP
    ii  linux-headers-5.4.0-105               5.4.0-105.119                           all          Header files related to Linux kernel version 5.4.0
    ii  linux-headers-5.4.0-105-generic       5.4.0-105.119                           amd64        Linux kernel headers for version 5.4.0 on 64 bit x86 SMP
    hi  linux-headers-generic                 5.4.0.105.109                           amd64        Generic Linux kernel headers
    ii  linux-image-5.4.0-104-generic         5.4.0-104.118                           amd64        Signed kernel image generic
    ii  linux-image-5.4.0-105-generic         5.4.0-105.119                           amd64        Signed kernel image generic
    hi  linux-image-generic                   5.4.0.105.109                           amd64        Generic Linux kernel image
    ii  linux-libc-dev:amd64                  5.4.0-105.119                           amd64        Linux Kernel Headers for development
    ii  linux-modules-5.4.0-104-generic       5.4.0-104.118                           amd64        Linux kernel extra modules for version 5.4.0 on 64 bit x86 SMP
    ii  linux-modules-5.4.0-105-generic       5.4.0-105.119                           amd64        Linux kernel extra modules for version 5.4.0 on 64 bit x86 SMP
    ii  linux-modules-extra-5.4.0-104-generic 5.4.0-104.118                           amd64        Linux kernel extra modules for version 5.4.0 on 64 bit x86 SMP
    ii  linux-modules-extra-5.4.0-105-generic 5.4.0-105.119                           amd64        Linux kernel extra modules for version 5.4.0 on 64 bit x86 SMP
    ii  mlnx-ofed-kernel-dkms                 5.4-OFED.5.4.3.0.3.1                    all          DKMS support for mlnx-ofed kernel modules
    ii  mlnx-ofed-kernel-utils                5.4-OFED.5.4.3.0.3.1                    amd64        Userspace tools to restart and tune mlnx-ofed kernel modules
    ii  mlnx-tools                            5.2.0-0.54303                           amd64        Userspace tools to restart and tune MLNX_OFED kernel modules
    ii  nvidia-kernel-common-470              470.103.01-0ubuntu0.20.04.1             amd64        Shared files used with the kernel module
    ii  nvidia-kernel-source-470              470.103.01-0ubuntu0.20.04.1             amd64        NVIDIA kernel source package
    ii  nvidia-peer-memory                    1.2-0                                   all          nvidia peer memory kernel module.
    ii  nvidia-peer-memory-dkms               1.2-0                                   all          DKMS support for nvidia-peer-memory kernel modules
    ii  rsyslog                               8.2001.0-1ubuntu1.1                     amd64        reliable system and kernel logging daemon
    ii  srp-dkms                              5.4-OFED.5.4.3.0.1.1                    all          DKMS support fo srp kernel modules
    Code block. Kernel, IB related package version check
  2. Use the apt-mark command to hold the package update.
    Color mode
    # apt-mark hold <package name>
    # apt-mark hold <package name>
    Code block. Package update hold

Intel E810 Driver Update

Check the version of the Intel E810 driver and update it to the recommended version.

Notice

The driver update method is as follows.

  1. Move the basic driver tar file to the desired directory.
Example: /home/username/ice or /usr/local/src/ice
  1. Untar / unzip the Archiver file.

    • x.x.x is the version number of the driver tar file.
      Color mode
      tar zxf ice-x.x.x.tar.gz
      tar zxf ice-x.x.x.tar.gz
      Code block. Unzip file
  2. Change to the driver src directory.

    • x.x.x is the version number of the driver tar file.
      Color mode
      cd ice-x.x.x/src/
      cd ice-x.x.x/src/
      Code block. Directory change
  3. Compile the driver module.

    Color mode
    make install
    make install
    Code Block. Driver Module Compile

  4. After the update is complete, check the version.

    Color mode
    lsmod | grep ice
    modinfo ice | grep version
    lsmod | grep ice
    modinfo ice | grep version
    Code Block. Version Check

NVIDIA driver check

Note
nvidia-smi topo, IB nv_peer_mem status check

To check the NVIDIA driver (nvidia-smi topo, IB nv_peer_mem status) and inspect the IaaS HW level, follow the next procedure.

  1. Check the GPU driver and HW status.

    Color mode
    user@bm-dev-001:~$ nvidia-smi topo -m
            GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    mlx5_0  mlx5_1  mlx5_2  mlx5_3  CPU Affinity    NUMA Affinity
    GPU0     X      NV12    NV12    NV12    NV12    NV12    NV12    NV12    SYS     PXB     SYS     SYS     48-63   3
    GPU1    NV12     X      NV12    NV12    NV12    NV12    NV12    NV12    SYS     PXB     SYS     SYS     48-63   3
    GPU2    NV12    NV12     X      NV12    NV12    NV12    NV12    NV12    PXB     SYS     SYS     SYS     16-31   1
    GPU3    NV12    NV12    NV12     X      NV12    NV12    NV12    NV12    PXB     SYS     SYS     SYS     16-31   1
    GPU4    NV12    NV12    NV12    NV12     X      NV12    NV12    NV12    SYS     SYS     SYS     PXB     112-127 7
    GPU5    NV12    NV12    NV12    NV12    NV12     X      NV12    NV12    SYS     SYS     SYS     PXB     112-127 7
    GPU6    NV12    NV12    NV12    NV12    NV12    NV12     X      NV12    SYS     SYS     PXB     SYS     80-95   5
    GPU7    NV12    NV12    NV12    NV12    NV12    NV12    NV12     X      SYS     SYS     PXB     SYS     80-95   5
    mlx5_0  SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS      X      SYS     SYS     SYS
    mlx5_1  PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      SYS     SYS
    mlx5_2  SYS     SYS     SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS      X      SYS
    mlx5_3  SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS      X
    
    Legend:
    
      X    = Self
      SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
      NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
      PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
      PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
      PIX  = Connection traversing at most a single PCIe bridge
      NV#  = Connection traversing a bonded set of # NVLinks
    user@bm-dev-001:~$ nvidia-smi topo -m
            GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    mlx5_0  mlx5_1  mlx5_2  mlx5_3  CPU Affinity    NUMA Affinity
    GPU0     X      NV12    NV12    NV12    NV12    NV12    NV12    NV12    SYS     PXB     SYS     SYS     48-63   3
    GPU1    NV12     X      NV12    NV12    NV12    NV12    NV12    NV12    SYS     PXB     SYS     SYS     48-63   3
    GPU2    NV12    NV12     X      NV12    NV12    NV12    NV12    NV12    PXB     SYS     SYS     SYS     16-31   1
    GPU3    NV12    NV12    NV12     X      NV12    NV12    NV12    NV12    PXB     SYS     SYS     SYS     16-31   1
    GPU4    NV12    NV12    NV12    NV12     X      NV12    NV12    NV12    SYS     SYS     SYS     PXB     112-127 7
    GPU5    NV12    NV12    NV12    NV12    NV12     X      NV12    NV12    SYS     SYS     SYS     PXB     112-127 7
    GPU6    NV12    NV12    NV12    NV12    NV12    NV12     X      NV12    SYS     SYS     PXB     SYS     80-95   5
    GPU7    NV12    NV12    NV12    NV12    NV12    NV12    NV12     X      SYS     SYS     PXB     SYS     80-95   5
    mlx5_0  SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS      X      SYS     SYS     SYS
    mlx5_1  PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      SYS     SYS
    mlx5_2  SYS     SYS     SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS      X      SYS
    mlx5_3  SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS      X
    
    Legend:
    
      X    = Self
      SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
      NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
      PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
      PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
      PIX  = Connection traversing at most a single PCIe bridge
      NV#  = Connection traversing a bonded set of # NVLinks
    Code Block. GPU Driver and HW Status Check

  2. Check the NVSwitch HW status.

    Color mode
    user@bm-dev-001:~$ nvidia-smi nvlink --status
    GPU 0: NVIDIA A100-SXM4-80GB (UUID: GPU-2c0d1d6b-e348-55fc-44cf-cd65a954b36c)
             Link 0: 25 GB/s
             Link 1: 25 GB/s
             Link 2: 25 GB/s
             Link 3: 25 GB/s
             Link 4: 25 GB/s
             Link 5: 25 GB/s
             Link 6: 25 GB/s
             Link 7: 25 GB/s
             Link 8: 25 GB/s
             Link 9: 25 GB/s
             Link 10: 25 GB/s
             Link 11: 25 GB/s
    GPU 1: NVIDIA A100-SXM4-80GB (UUID: GPU-96f429d8-893a-a9ea-deca-feffd90669e9)
             Link 0: 25 GB/s
             Link 1: 25 GB/s
             Link 2: 25 GB/s
             Link 3: 25 GB/s
             Link 4: 25 GB/s
             Link 5: 25 GB/s
             Link 6: 25 GB/s
             Link 7: 25 GB/s
             Link 8: 25 GB/s
             Link 9: 25 GB/s
             Link 10: 25 GB/s
             Link 11: 25 GB/s
    GPU 2: NVIDIA A100-SXM4-80GB (UUID: GPU-2e601952-b442-b757-a035-725cd320f589)
             Link 0: 25 GB/s
             Link 1: 25 GB/s
             Link 2: 25 GB/s
             Link 3: 25 GB/s
             Link 4: 25 GB/s
             Link 5: 25 GB/s
             Link 6: 25 GB/s
             Link 7: 25 GB/s
             Link 8: 25 GB/s
             Link 9: 25 GB/s
             Link 10: 25 GB/s
             Link 11: 25 GB/s
    GPU 3: NVIDIA A100-SXM4-80GB (UUID: GPU-bcbfd885-a9f8-ec8c-045b-c521472b4fed)
             Link 0: 25 GB/s
             Link 1: 25 GB/s
             Link 2: 25 GB/s
             Link 3: 25 GB/s
             Link 4: 25 GB/s
             Link 5: 25 GB/s
             Link 6: 25 GB/s
             Link 7: 25 GB/s
             Link 8: 25 GB/s
             Link 9: 25 GB/s
             Link 10: 25 GB/s
             Link 11: 25 GB/s
    GPU 4: NVIDIA A100-SXM4-80GB (UUID: GPU-30273090-2d78-fc7a-a360-ec5f871dd488)
             Link 0: 25 GB/s
             Link 1: 25 GB/s
             Link 2: 25 GB/s
             Link 3: 25 GB/s
             Link 4: 25 GB/s
             Link 5: 25 GB/s
             Link 6: 25 GB/s
             Link 7: 25 GB/s
             Link 8: 25 GB/s
             Link 9: 25 GB/s
             Link 10: 25 GB/s
             Link 11: 25 GB/s
    GPU 5: NVIDIA A100-SXM4-80GB (UUID: GPU-5ce7ef61-56dd-fb18-aa7c-be610c8d51c3)
             Link 0: 25 GB/s
             Link 1: 25 GB/s
             Link 2: 25 GB/s
             Link 3: 25 GB/s
             Link 4: 25 GB/s
             Link 5: 25 GB/s
             Link 6: 25 GB/s
             Link 7: 25 GB/s
             Link 8: 25 GB/s
             Link 9: 25 GB/s
             Link 10: 25 GB/s
             Link 11: 25 GB/s
    GPU 6: NVIDIA A100-SXM4-80GB (UUID: GPU-740a527b-b286-8b85-35eb-b6b41c0bb6d7)
             Link 0: 25 GB/s
             Link 1: 25 GB/s
             Link 2: 25 GB/s
             Link 3: 25 GB/s
             Link 4: 25 GB/s
             Link 5: 25 GB/s
             Link 6: 25 GB/s
             Link 7: 25 GB/s
             Link 8: 25 GB/s
             Link 9: 25 GB/s
             Link 10: 25 GB/s
             Link 11: 25 GB/s
    GPU 7: NVIDIA A100-SXM4-80GB (UUID: GPU-1fb6de95-60f6-dbf2-ffca-f7680577e37c)
             Link 0: 25 GB/s
             Link 1: 25 GB/s
             Link 2: 25 GB/s
             Link 3: 25 GB/s
             Link 4: 25 GB/s
             Link 5: 25 GB/s
             Link 6: 25 GB/s
             Link 7: 25 GB/s
             Link 8: 25 GB/s
             Link 9: 25 GB/s
             Link 10: 25 GB/s
             Link 11: 25 GB/s
    user@bm-dev-001:~$ nvidia-smi nvlink --status
    GPU 0: NVIDIA A100-SXM4-80GB (UUID: GPU-2c0d1d6b-e348-55fc-44cf-cd65a954b36c)
             Link 0: 25 GB/s
             Link 1: 25 GB/s
             Link 2: 25 GB/s
             Link 3: 25 GB/s
             Link 4: 25 GB/s
             Link 5: 25 GB/s
             Link 6: 25 GB/s
             Link 7: 25 GB/s
             Link 8: 25 GB/s
             Link 9: 25 GB/s
             Link 10: 25 GB/s
             Link 11: 25 GB/s
    GPU 1: NVIDIA A100-SXM4-80GB (UUID: GPU-96f429d8-893a-a9ea-deca-feffd90669e9)
             Link 0: 25 GB/s
             Link 1: 25 GB/s
             Link 2: 25 GB/s
             Link 3: 25 GB/s
             Link 4: 25 GB/s
             Link 5: 25 GB/s
             Link 6: 25 GB/s
             Link 7: 25 GB/s
             Link 8: 25 GB/s
             Link 9: 25 GB/s
             Link 10: 25 GB/s
             Link 11: 25 GB/s
    GPU 2: NVIDIA A100-SXM4-80GB (UUID: GPU-2e601952-b442-b757-a035-725cd320f589)
             Link 0: 25 GB/s
             Link 1: 25 GB/s
             Link 2: 25 GB/s
             Link 3: 25 GB/s
             Link 4: 25 GB/s
             Link 5: 25 GB/s
             Link 6: 25 GB/s
             Link 7: 25 GB/s
             Link 8: 25 GB/s
             Link 9: 25 GB/s
             Link 10: 25 GB/s
             Link 11: 25 GB/s
    GPU 3: NVIDIA A100-SXM4-80GB (UUID: GPU-bcbfd885-a9f8-ec8c-045b-c521472b4fed)
             Link 0: 25 GB/s
             Link 1: 25 GB/s
             Link 2: 25 GB/s
             Link 3: 25 GB/s
             Link 4: 25 GB/s
             Link 5: 25 GB/s
             Link 6: 25 GB/s
             Link 7: 25 GB/s
             Link 8: 25 GB/s
             Link 9: 25 GB/s
             Link 10: 25 GB/s
             Link 11: 25 GB/s
    GPU 4: NVIDIA A100-SXM4-80GB (UUID: GPU-30273090-2d78-fc7a-a360-ec5f871dd488)
             Link 0: 25 GB/s
             Link 1: 25 GB/s
             Link 2: 25 GB/s
             Link 3: 25 GB/s
             Link 4: 25 GB/s
             Link 5: 25 GB/s
             Link 6: 25 GB/s
             Link 7: 25 GB/s
             Link 8: 25 GB/s
             Link 9: 25 GB/s
             Link 10: 25 GB/s
             Link 11: 25 GB/s
    GPU 5: NVIDIA A100-SXM4-80GB (UUID: GPU-5ce7ef61-56dd-fb18-aa7c-be610c8d51c3)
             Link 0: 25 GB/s
             Link 1: 25 GB/s
             Link 2: 25 GB/s
             Link 3: 25 GB/s
             Link 4: 25 GB/s
             Link 5: 25 GB/s
             Link 6: 25 GB/s
             Link 7: 25 GB/s
             Link 8: 25 GB/s
             Link 9: 25 GB/s
             Link 10: 25 GB/s
             Link 11: 25 GB/s
    GPU 6: NVIDIA A100-SXM4-80GB (UUID: GPU-740a527b-b286-8b85-35eb-b6b41c0bb6d7)
             Link 0: 25 GB/s
             Link 1: 25 GB/s
             Link 2: 25 GB/s
             Link 3: 25 GB/s
             Link 4: 25 GB/s
             Link 5: 25 GB/s
             Link 6: 25 GB/s
             Link 7: 25 GB/s
             Link 8: 25 GB/s
             Link 9: 25 GB/s
             Link 10: 25 GB/s
             Link 11: 25 GB/s
    GPU 7: NVIDIA A100-SXM4-80GB (UUID: GPU-1fb6de95-60f6-dbf2-ffca-f7680577e37c)
             Link 0: 25 GB/s
             Link 1: 25 GB/s
             Link 2: 25 GB/s
             Link 3: 25 GB/s
             Link 4: 25 GB/s
             Link 5: 25 GB/s
             Link 6: 25 GB/s
             Link 7: 25 GB/s
             Link 8: 25 GB/s
             Link 9: 25 GB/s
             Link 10: 25 GB/s
             Link 11: 25 GB/s
    Code block. NVSwitch HW status check

  3. Check the InfiniBand(IB) HCA card HW status and Link.

    Color mode
    user@bm-dev-001:~$ ibdev2netdev -v
    cat: /sys/class/infiniband/mlx5_0/device/vpd: Permission denied
    0000:45:00.0 mlx5_0 (MT4123 -            )                 fw 20.29.1016 port 1 (ACTIVE) ==> ibs18 (Down)
    cat: /sys/class/infiniband/mlx5_1/device/vpd: Permission denied
    0000:0e:00.0 mlx5_1 (MT4123 -            )                 fw 20.29.1016 port 1 (ACTIVE) ==> ibs17 (Down)
    cat: /sys/class/infiniband/mlx5_2/device/vpd: Permission denied
    0000:c5:00.0 mlx5_2 (MT4123 -            )                 fw 20.29.1016 port 1 (ACTIVE) ==> ibs20 (Down)
    cat: /sys/class/infiniband/mlx5_3/device/vpd: Permission denied
    0000:85:00.0 mlx5_3 (MT4123 -            )                 fw 20.29.1016 port 1 (ACTIVE) ==> ibs19 (Down)
    user@bm-dev-001:~$
    
    
    root@bm-dev-001:~# ibstat
    CA 'mlx5_0'
            CA type: MT4123
            Number of ports: 1
            Firmware version: 20.29.1016
            Hardware version: 0
            Node GUID: 0x88e9a4ffff5060ac
            System image GUID: 0x88e9a4ffff5060ac
            Port 1:
                    State: Active
                    Physical state: LinkUp
                    Rate: 200
                    Base lid: 8
                    LMC: 0
                    SM lid: 1
                    Capability mask: 0x2651e848
                    Port GUID: 0x88e9a4ffff5060ac
                    Link layer: InfiniBand
    CA 'mlx5_1'
            CA type: MT4123
            Number of ports: 1
            Firmware version: 20.29.1016
            Hardware version: 0
            Node GUID: 0x88e9a4ffff504080
            System image GUID: 0x88e9a4ffff504080
            Port 1:
                    State: Active
                    Physical state: LinkUp
                    Rate: 200
                    Base lid: 5
                    LMC: 0
                    SM lid: 1
                    Capability mask: 0x2651e848
                    Port GUID: 0x88e9a4ffff504080
                    Link layer: InfiniBand
    CA 'mlx5_2'
            CA type: MT4123
            Number of ports: 1
            Firmware version: 20.29.1016
            Hardware version: 0
            Node GUID: 0x88e9a4ffff505038
            System image GUID: 0x88e9a4ffff505038
            Port 1:
                    State: Active
                    Physical state: LinkUp
                    Rate: 200
                    Base lid: 2
                    LMC: 0
                    SM lid: 1
                    Capability mask: 0x2651e848
                    Port GUID: 0x88e9a4ffff505038
                    Link layer: InfiniBand
    CA 'mlx5_3'
            CA type: MT4123
            Number of ports: 1
            Firmware version: 20.29.1016
            Hardware version: 0
            Node GUID: 0x88e9a4ffff504094
            System image GUID: 0x88e9a4ffff504094
            Port 1:
                    State: Active
                    Physical state: LinkUp
                    Rate: 200
                    Base lid: 7
                    LMC: 0
                    SM lid: 1
                    Capability mask: 0x2651e848
                    Port GUID: 0x88e9a4ffff504094
                    Link layer: InfiniBand
    user@bm-dev-001:~$ ibdev2netdev -v
    cat: /sys/class/infiniband/mlx5_0/device/vpd: Permission denied
    0000:45:00.0 mlx5_0 (MT4123 -            )                 fw 20.29.1016 port 1 (ACTIVE) ==> ibs18 (Down)
    cat: /sys/class/infiniband/mlx5_1/device/vpd: Permission denied
    0000:0e:00.0 mlx5_1 (MT4123 -            )                 fw 20.29.1016 port 1 (ACTIVE) ==> ibs17 (Down)
    cat: /sys/class/infiniband/mlx5_2/device/vpd: Permission denied
    0000:c5:00.0 mlx5_2 (MT4123 -            )                 fw 20.29.1016 port 1 (ACTIVE) ==> ibs20 (Down)
    cat: /sys/class/infiniband/mlx5_3/device/vpd: Permission denied
    0000:85:00.0 mlx5_3 (MT4123 -            )                 fw 20.29.1016 port 1 (ACTIVE) ==> ibs19 (Down)
    user@bm-dev-001:~$
    
    
    root@bm-dev-001:~# ibstat
    CA 'mlx5_0'
            CA type: MT4123
            Number of ports: 1
            Firmware version: 20.29.1016
            Hardware version: 0
            Node GUID: 0x88e9a4ffff5060ac
            System image GUID: 0x88e9a4ffff5060ac
            Port 1:
                    State: Active
                    Physical state: LinkUp
                    Rate: 200
                    Base lid: 8
                    LMC: 0
                    SM lid: 1
                    Capability mask: 0x2651e848
                    Port GUID: 0x88e9a4ffff5060ac
                    Link layer: InfiniBand
    CA 'mlx5_1'
            CA type: MT4123
            Number of ports: 1
            Firmware version: 20.29.1016
            Hardware version: 0
            Node GUID: 0x88e9a4ffff504080
            System image GUID: 0x88e9a4ffff504080
            Port 1:
                    State: Active
                    Physical state: LinkUp
                    Rate: 200
                    Base lid: 5
                    LMC: 0
                    SM lid: 1
                    Capability mask: 0x2651e848
                    Port GUID: 0x88e9a4ffff504080
                    Link layer: InfiniBand
    CA 'mlx5_2'
            CA type: MT4123
            Number of ports: 1
            Firmware version: 20.29.1016
            Hardware version: 0
            Node GUID: 0x88e9a4ffff505038
            System image GUID: 0x88e9a4ffff505038
            Port 1:
                    State: Active
                    Physical state: LinkUp
                    Rate: 200
                    Base lid: 2
                    LMC: 0
                    SM lid: 1
                    Capability mask: 0x2651e848
                    Port GUID: 0x88e9a4ffff505038
                    Link layer: InfiniBand
    CA 'mlx5_3'
            CA type: MT4123
            Number of ports: 1
            Firmware version: 20.29.1016
            Hardware version: 0
            Node GUID: 0x88e9a4ffff504094
            System image GUID: 0x88e9a4ffff504094
            Port 1:
                    State: Active
                    Physical state: LinkUp
                    Rate: 200
                    Base lid: 7
                    LMC: 0
                    SM lid: 1
                    Capability mask: 0x2651e848
                    Port GUID: 0x88e9a4ffff504094
                    Link layer: InfiniBand
    Code block. InfiniBand(IB) HCA card HW status and Link check

IB bandwidth communication check

To check the IB bandwidth communication status (ib_send_bw) and inspect the IaaS HW level, follow these steps.

  1. Check the name of the IB HCA interface.

    Color mode
    user@bm-dev-001:~$ ibdev2netdev
    mlx5_0 port 1 ==> ibs18 (Down)
    mlx5_1 port 1 ==> ibs17 (Down)
    mlx5_2 port 1 ==> ibs20 (Down)
    mlx5_3 port 1 ==> ibs19 (Down)
    user@bm-dev-001:~$ ibdev2netdev
    mlx5_0 port 1 ==> ibs18 (Down)
    mlx5_1 port 1 ==> ibs17 (Down)
    mlx5_2 port 1 ==> ibs20 (Down)
    mlx5_3 port 1 ==> ibs19 (Down)
    Code block. Check the name of IB HCA interface

  2. Check the HCA interface that can communicate with IB Switch#1.

    Color mode
    mlx5_0 port 1 ==> ibs18 (Down)
    mlx5_2 port 1 ==> ibs20 (Down)
    mlx5_0 port 1 ==> ibs18 (Down)
    mlx5_2 port 1 ==> ibs20 (Down)
    Code Block. HCA Interface Check

  3. Check the HCA interface that can communicate with IB Switch#2.

    Color mode
    mlx5_1 port 1 ==> ibs17 (Down)
    mlx5_3 port 1 ==> ibs19 (Down)
    mlx5_1 port 1 ==> ibs17 (Down)
    mlx5_3 port 1 ==> ibs19 (Down)
    Code Block. HCA Interface Check

  4. Use SERVER Side commands to check the communication status.

    • Client Side command is entered secondarily for mutual communication
      Color mode
      user@bm-dev-001:~$ ib_send_bw -d mlx5_3 -i 1 –F
      ************************************
      * Waiting for client to connect... *
      ************************************
      ---------------------------------------------------------------------------------------
                          Send BW Test
       Dual-port       : OFF          Device         : mlx5_3
       Number of qps   : 1            Transport type : IB
       Connection type : RC           Using SRQ      : OFF
       PCIe relax order: ON
       ibv_wr* API     : ON
       RX depth        : 512
       CQ Moderation   : 1
       Mtu             : 4096[B]
       Link type       : IB
       Max inline data : 0[B]
       rdma_cm QPs     : OFF
       Data ex. method : Ethernet
      ---------------------------------------------------------------------------------------
       local address: LID 0x07 QPN 0x002e PSN 0xa86622
       remote address: LID 0x0a QPN 0x002d PSN 0xfc58dd
      ---------------------------------------------------------------------------------------
       #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
       65536      1000             0.00               19827.40                   0.317238
      ---------------------------------------------------------------------------------------
      user@bm-dev-001:~$ ib_send_bw -d mlx5_3 -i 1 –F
      ************************************
      * Waiting for client to connect... *
      ************************************
      ---------------------------------------------------------------------------------------
                          Send BW Test
       Dual-port       : OFF          Device         : mlx5_3
       Number of qps   : 1            Transport type : IB
       Connection type : RC           Using SRQ      : OFF
       PCIe relax order: ON
       ibv_wr* API     : ON
       RX depth        : 512
       CQ Moderation   : 1
       Mtu             : 4096[B]
       Link type       : IB
       Max inline data : 0[B]
       rdma_cm QPs     : OFF
       Data ex. method : Ethernet
      ---------------------------------------------------------------------------------------
       local address: LID 0x07 QPN 0x002e PSN 0xa86622
       remote address: LID 0x0a QPN 0x002d PSN 0xfc58dd
      ---------------------------------------------------------------------------------------
       #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
       65536      1000             0.00               19827.40                   0.317238
      ---------------------------------------------------------------------------------------
      Code Block. Communication Status Check
  5. Use the CLIENT Side command to check the communication status.

    • SERVER Side command is entered first for mutual communication
      Color mode
      root@bm-dev-003:~# ib_send_bw -d mlx5_3 -i 1 -F <SERVER Side IP>
      ---------------------------------------------------------------------------------------
                          Send BW Test
       Dual-port       : OFF          Device         : mlx5_3
       Number of qps   : 1            Transport type : IB
       Connection type : RC           Using SRQ      : OFF
       PCIe relax order: ON
       ibv_wr* API     : ON
       TX depth        : 128
       CQ Moderation   : 1
       Mtu             : 4096[B]
       Link type       : IB
       Max inline data : 0[B]
       rdma_cm QPs     : OFF
       Data ex. method : Ethernet
      ---------------------------------------------------------------------------------------
       local address: LID 0x0a QPN 0x002a PSN 0x98a48e
       remote address: LID 0x07 QPN 0x002c PSN 0xe68304
      ---------------------------------------------------------------------------------------
       #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
       65536      1000             19008.49            19006.37                  0.304102
      ---------------------------------------------------------------------------------------
      root@bm-dev-003:~# ib_send_bw -d mlx5_3 -i 1 -F <SERVER Side IP>
      ---------------------------------------------------------------------------------------
                          Send BW Test
       Dual-port       : OFF          Device         : mlx5_3
       Number of qps   : 1            Transport type : IB
       Connection type : RC           Using SRQ      : OFF
       PCIe relax order: ON
       ibv_wr* API     : ON
       TX depth        : 128
       CQ Moderation   : 1
       Mtu             : 4096[B]
       Link type       : IB
       Max inline data : 0[B]
       rdma_cm QPs     : OFF
       Data ex. method : Ethernet
      ---------------------------------------------------------------------------------------
       local address: LID 0x0a QPN 0x002a PSN 0x98a48e
       remote address: LID 0x07 QPN 0x002c PSN 0xe68304
      ---------------------------------------------------------------------------------------
       #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
       65536      1000             19008.49            19006.37                  0.304102
      ---------------------------------------------------------------------------------------
      Code Block. Communication Status Check

Check the IB service-related kernel modules (lsmod) to inspect the IaaS HW level.

Color mode
user@bm-dev-001:~$ lsmod | grep nv_peer_mem
nv_peer_mem            16384  0
ib_core               315392  9 rdma_cm,ib_ipoib,nv_peer_mem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
nvidia              35315712  156 nvidia_uvm,nv_peer_mem,nvidia_modeset
user@bm-dev-001:~$ lsmod | grep nv_peer_mem
nv_peer_mem            16384  0
ib_core               315392  9 rdma_cm,ib_ipoib,nv_peer_mem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
nvidia              35315712  156 nvidia_uvm,nv_peer_mem,nvidia_modeset
Code block. IB service related kernel module check(1)
Color mode
user@bm-dev-001:~$ service nv_peer_mem status
  nv_peer_mem.service - LSB: Activates/Deactivates nv_peer_mem to \ start at boot time.
     Loaded: loaded (/etc/init.d/nv_peer_mem; generated)
     Active: active (exited) since Mon 2023-03-13 16:21:33 KST; 2 days ago
       Docs: man:systemd-sysv-generator(8)
    Process: 4913 ExecStart=/etc/init.d/nv_peer_mem start (code=exited, status=0/SUCCESS)
user@bm-dev-001:~$ service nv_peer_mem status
  nv_peer_mem.service - LSB: Activates/Deactivates nv_peer_mem to \ start at boot time.
     Loaded: loaded (/etc/init.d/nv_peer_mem; generated)
     Active: active (exited) since Mon 2023-03-13 16:21:33 KST; 2 days ago
       Docs: man:systemd-sysv-generator(8)
    Process: 4913 ExecStart=/etc/init.d/nv_peer_mem start (code=exited, status=0/SUCCESS)
Code block. IB service-related kernel module check(2)
Color mode
user@bm-dev-001:~$ lsmod | grep ib
libiscsi_tcp           32768  1 iscsi_tcp
libiscsi               57344  2 libiscsi_tcp,iscsi_tcp
scsi_transport_iscsi   110592  4 libiscsi_tcp,iscsi_tcp,libiscsi
ib_ipoib              131072  0
ib_cm                  57344  2 rdma_cm,ib_ipoib
ib_umad                24576  8
mlx5_ib               380928  0
ib_uverbs             135168  18 rdma_ucm,mlx5_ib
ib_core               315392  9 rdma_cm,ib_ipoib,nv_peer_mem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
libcrc32c              16384  2 btrfs,raid456
mlx5_core            1458176  1 mlx5_ib
auxiliary              16384  2 mlx5_ib,mlx5_core
mlx_compat             65536  12 rdma_cm,ib_ipoib,mlxdevm,iw_cm,auxiliary,ib_umad,ib_core,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm,mlx5_core
user@bm-dev-001:~$ lsmod | grep ib
libiscsi_tcp           32768  1 iscsi_tcp
libiscsi               57344  2 libiscsi_tcp,iscsi_tcp
scsi_transport_iscsi   110592  4 libiscsi_tcp,iscsi_tcp,libiscsi
ib_ipoib              131072  0
ib_cm                  57344  2 rdma_cm,ib_ipoib
ib_umad                24576  8
mlx5_ib               380928  0
ib_uverbs             135168  18 rdma_ucm,mlx5_ib
ib_core               315392  9 rdma_cm,ib_ipoib,nv_peer_mem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
libcrc32c              16384  2 btrfs,raid456
mlx5_core            1458176  1 mlx5_ib
auxiliary              16384  2 mlx5_ib,mlx5_core
mlx_compat             65536  12 rdma_cm,ib_ipoib,mlxdevm,iw_cm,auxiliary,ib_umad,ib_core,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm,mlx5_core
Code block. IB service-related kernel module check(3)

Storage Physical Disk Resources and Multi-Path Check

Check the IaaS HW level by checking the storage physical disk resources and Multi-Path.

Color mode
root@bm-dev-002:/tmp# fdisk –l
root@bm-dev-002:/tmp# fdisk –l
Code block. Storage physical disk resource check result
Color mode
root@bm-dev-002:/tmp# multipath –ll
root@bm-dev-002:/tmp# multipath –ll
Code Block. Multi-Path Verification Result

Multi-node GPU Cluster new deployment after checking Service Network

Use the following command to check if the MII Status of Bonding and Slave Interface is up.

  • command

    Color mode
    root@mngc-001:~# cat /proc/net/bonding/bond-srv
    Ethernet Channel Bonding Driver: v5.15.0-25-generic
    root@mngc-001:~# cat /proc/net/bonding/bond-srv
    Ethernet Channel Bonding Driver: v5.15.0-25-generic
    Code Block. Service Network Check Command

  • confirmation result

    Color mode
    Bonding Mode: fault-tolerance (active-backup)
    Primary Slave: None
    Currently Active Slave: ens9f0
    MII Status: up
    MII Polling Interval (ms): 100
    Up Delay (ms): 0
    Down Delay (ms): 0
    Peer Notification Delay (ms): 0
    
    Slave Interface: ens9f0
    MII Status: up
    Speed: 100000 Mbps
    Duplex: full
    Link Failure Count: 0
    Permanent HW addr: 30:3e:a7:02:35:70
    Slave queue ID: 0
    
    Slave Interface: ens11f0
    MII Status: up
    Speed: 100000 Mbps
    Duplex: full
    Link Failure Count: 0
    Permanent HW addr: 30:3e:a7:02:2f:e8
    Slave queue ID: 0
    Bonding Mode: fault-tolerance (active-backup)
    Primary Slave: None
    Currently Active Slave: ens9f0
    MII Status: up
    MII Polling Interval (ms): 100
    Up Delay (ms): 0
    Down Delay (ms): 0
    Peer Notification Delay (ms): 0
    
    Slave Interface: ens9f0
    MII Status: up
    Speed: 100000 Mbps
    Duplex: full
    Link Failure Count: 0
    Permanent HW addr: 30:3e:a7:02:35:70
    Slave queue ID: 0
    
    Slave Interface: ens11f0
    MII Status: up
    Speed: 100000 Mbps
    Duplex: full
    Link Failure Count: 0
    Permanent HW addr: 30:3e:a7:02:2f:e8
    Slave queue ID: 0
    Code Block. Service Network Check Command Result

Reference
If some Slave Interface is in a down state, please use the Support Center’s Contact Us to report the abnormal situation and take action.

Multi-node GPU Cluster new deployment after checking Time Server and time synchronization

The OS image has the chrony daemon installed and set to synchronize with the SCP NTP server. Use the following command to check if there are any lines marked with ^* in the MS Name column.

  • command

    Color mode
    root@mngc-001:~# chronyc sources -V
    root@mngc-001:~# chronyc sources -V
    Code Block. chrony daemon installation command

  • confirmation result

    Color mode
    MS Name/IP address         Stratum Poll Reach LastRx Last sample
    ===============================================================================
    ^+ 198.19.1.53                   4  10   377  1040    -16us[  -37us] +/- 9982us
    ^* 198.19.1.54                   4  10   377   312   -367us[ -388us] +/-   13ms
    MS Name/IP address         Stratum Poll Reach LastRx Last sample
    ===============================================================================
    ^+ 198.19.1.53                   4  10   377  1040    -16us[  -37us] +/- 9982us
    ^* 198.19.1.54                   4  10   377   312   -367us[ -388us] +/-   13ms
    Code block. chrony daemon installation check result

GPU MIG/ECC Setting Initialization Check Guide

When applying for a multi-node GPU cluster product, the GPU MIG/ECC setting is initialized. However, to apply the exact setting value, please restart it once at the beginning, and then check and use it according to the inspection guide to see if the setting value is applied.

Reference
  • MIG: Multi-Instance GPU
  • ECC: Error Correction Code

MIG Setup Initialization

Refer to the following for how to check and initialize MIG settings.

Use the following command to check if the status value of MIG M is Disabled.

  • command

    Color mode
    root@bm-dev-001:~#nvidia-smi
    root@bm-dev-001:~#nvidia-smi
    Code Block. MIG M. Initialize Settings

  • confirmation result

    Color mode
    +-----------------------------------------------------------------------------------------+
    |  NVIDIA-SMI 470.129.06        Driver version: 470.129.06        CUDA Version: 11.4      |
    |----------------------------------+-----------------------------+------------------------|
    |  GPU  Name        Persistence-M  |  Bus-Id             Disp.A  |  Volatile Uncorr. ECC  |
    |  Fan  Temp  Perf  Pwr:Usage/Cap  |               Memory-Usage  |  GPU-Util  Compute M.  |
    |                                  |                             |                MIG M.  |
    |==================================+=============================+========================|
    |    0  NVIDIA A100-SXM...    Off  |  00000000:03:00.0      Off  |                   Off  |
    |  N/A  29C     P0    57W  /  400W |          0MiB  /  81251MiB  |    0%         Default  |
    |                                  |                             |              Disabled  |
    +----------------------------------+-----------------------------+------------------------+
    |    0  NVIDIA A100-SXM...    Off  |  00000000:0C:00.0      Off  |                   Off  |
    |  N/A  30C     P0    58W  /  400W |          0MiB  /  81251MiB  |    18%        Default  |
    |                                  |                             |              Disabled  |
    +-----------------------------------------------------------------------------------------+
    +-----------------------------------------------------------------------------------------+
    |  NVIDIA-SMI 470.129.06        Driver version: 470.129.06        CUDA Version: 11.4      |
    |----------------------------------+-----------------------------+------------------------|
    |  GPU  Name        Persistence-M  |  Bus-Id             Disp.A  |  Volatile Uncorr. ECC  |
    |  Fan  Temp  Perf  Pwr:Usage/Cap  |               Memory-Usage  |  GPU-Util  Compute M.  |
    |                                  |                             |                MIG M.  |
    |==================================+=============================+========================|
    |    0  NVIDIA A100-SXM...    Off  |  00000000:03:00.0      Off  |                   Off  |
    |  N/A  29C     P0    57W  /  400W |          0MiB  /  81251MiB  |    0%         Default  |
    |                                  |                             |              Disabled  |
    +----------------------------------+-----------------------------+------------------------+
    |    0  NVIDIA A100-SXM...    Off  |  00000000:0C:00.0      Off  |                   Off  |
    |  N/A  30C     P0    58W  /  400W |          0MiB  /  81251MiB  |    18%        Default  |
    |                                  |                             |              Disabled  |
    +-----------------------------------------------------------------------------------------+
    Code Block. MIG M. Initialization Setting Check Result

  • If MIG M.’s status value is not Disabled, use the following command to initialize MIG.

    Color mode
    root@bm-dev-001:~# nvidia-smi -mig 0
    root@bm-dev-001:~# nvidia-smi --gpu-reset
    root@bm-dev-001:~# nvidia-smi -mig 0
    root@bm-dev-001:~# nvidia-smi --gpu-reset
    Code Block. MIG M. Status Value Initialization

ECC Setting Initialization

Refer to the following for how to check and initialize the ECC settings.

Use the following command to check if the status value of Volatile Uncorr. ECC is Off.

  • command

    Color mode
    root@bm-dev-001:~#nvidia-smi
    root@bm-dev-001:~#nvidia-smi
    Code Block. ECC Setting Command

  • confirmation result

    Color mode
    +-----------------------------------------------------------------------------------------+
    |  NVIDIA-SMI 470.129.06        Driver version: 470.129.06        CUDA Version: 11.4      |
    |----------------------------------+-----------------------------+------------------------|
    |  GPU  Name        Persistence-M  |  Bus-Id             Disp.A  |  Volatile Uncorr. ECC  |
    |  Fan  Temp  Perf  Pwr:Usage/Cap  |               Memory-Usage  |  GPU-Util  Compute M.  |
    |                                  |                             |                MIG M.  |
    |==================================+=============================+========================|
    |    0  NVIDIA A100-SXM...    Off  |  00000000:03:00.0      Off  |                   Off  |
    |  N/A  29C     P0    57W  /  400W |          0MiB  /  81251MiB  |    0%         Default  |
    |                                  |                             |              Disabled  |
    +----------------------------------+-----------------------------+------------------------+
    |    0  NVIDIA A100-SXM...    Off  |  00000000:0C:00.0      Off  |                   Off  |
    |  N/A  30C     P0    61W  /  400W |          0MiB  /  81251MiB  |    18%        Default  |
    |                                  |                             |              Disabled  |
    +-----------------------------------------------------------------------------------------+
    +-----------------------------------------------------------------------------------------+
    |  NVIDIA-SMI 470.129.06        Driver version: 470.129.06        CUDA Version: 11.4      |
    |----------------------------------+-----------------------------+------------------------|
    |  GPU  Name        Persistence-M  |  Bus-Id             Disp.A  |  Volatile Uncorr. ECC  |
    |  Fan  Temp  Perf  Pwr:Usage/Cap  |               Memory-Usage  |  GPU-Util  Compute M.  |
    |                                  |                             |                MIG M.  |
    |==================================+=============================+========================|
    |    0  NVIDIA A100-SXM...    Off  |  00000000:03:00.0      Off  |                   Off  |
    |  N/A  29C     P0    57W  /  400W |          0MiB  /  81251MiB  |    0%         Default  |
    |                                  |                             |              Disabled  |
    +----------------------------------+-----------------------------+------------------------+
    |    0  NVIDIA A100-SXM...    Off  |  00000000:0C:00.0      Off  |                   Off  |
    |  N/A  30C     P0    61W  /  400W |          0MiB  /  81251MiB  |    18%        Default  |
    |                                  |                             |              Disabled  |
    +-----------------------------------------------------------------------------------------+
    Code Block. ECC Setting Check Result

  • Volatile Uncorr. ECC’s status value is On*, please proceed with rebooting.

  • Volatile Uncorr. ECC status value is not On* or Off, use the following command to initialize ECC. After initialization, reboot and check if the status value is Off.

Color mode
root@bm-dev-001:~# nvidia-smi --ecc-config=0
root@bm-dev-001:~# nvidia-smi --ecc-config=0
Code Block. ECC Status Value Check
ServiceWatch Agent Install
Release Note