Multi-node GPU Cluster Service Scope and Inspection Guide
Multi-node GPU Cluster service scope
In the event of an IaaS HW level issue with the Multi-node GPU Cluster service, technical support can be received through the Support Center’s Contact Us. However, risks due to changes such as OS Kernel updates or application installation are the responsibility of the user, so technical support may be difficult, please be cautious when performing system updates or other tasks.
IaaS HW level problem
- IPMI(iLO) HW monitoring console where the server’s internal HW fault event occurrence message occurs
- GPU HW operation error confirmed in nvdia-smi command
- HW error messages occurring from InfiniBand HCA card or InfiniBand Switch inspection
IaaS HW Inspection Guide
After applying for the Multi-node GPU Cluster service, it is recommended to check the IaaS HW level according to the inspection guide.
OS Kernel and Package Holding
- If you do not want automatic updates of package versions, it is recommended to block package updates using the
apt-markcommand. - It is recommended to block the update of Linux kernel or IB related package versions.
To proceed with OS Kernel and Package holding, follow the procedure below.
- Use the following command to check the version of the kernel and IB-related packages.Color mode
root@bm-dev-001:~# dpkg -l | egrep -i "kernel | mlnx" root@bm-dev-001:~# dpkg -l | egrep -i "kernel | nvidia" root@bm-dev-001:~# dpkg -l | egrep -i "kernel | linux-image" ii crash 7.2.8-1ubuntu1.20.04.1 amd64 kernel debugging utility, allowing gdb like syntax ii dkms 2.8.1-5ubuntu2 all Dynamic Kernel Module Support Framework ii dmeventd 2:1.02.167-1ubuntu1 amd64 Linux Kernel Device Mapper event daemon ii dmsetup 2:1.02.167-1ubuntu1 amd64 Linux Kernel Device Mapper userspace library ii iser-dkms 5.4-OFED.5.4.3.0.1.1 all DKMS support fo iser kernel modules ii isert-dkms 5.4-OFED.5.4.3.0.1.1 all DKMS support fo isert kernel modules ii kernel-mft-dkms 4.17.2-12 all DKMS support for kernel-mft kernel modules ii kmod 27-1ubuntu2 amd64 tools for managing Linux kernel modules ii knem 1.1.4.90mlnx1-OFED.5.1.2.5.0.1 amd64 userspace tools for the KNEM kernel module ii knem-dkms 1.1.4.90mlnx1-OFED.5.1.2.5.0.1 all DKMS support for mlnx-ofed kernel modules ii libaio1:amd64 0.3.112-5 amd64 Linux kernel AIO access library - shared library ii libdevmapper-event1.02.1:amd64 2:1.02.167-1ubuntu1 amd64 Linux Kernel Device Mapper event support library ii libdevmapper1.02.1:amd64 2:1.02.167-1ubuntu1 amd64 Linux Kernel Device Mapper userspace library ii libdrm-amdgpu1:amd64 2.4.107-8ubuntu1~20.04.2 amd64 Userspace interface to amdgpu-specific kernel DRM services -- runtime ii libdrm-common 2.4.107-8ubuntu1~20.04.2 all Userspace interface to kernel DRM services -- common files ii libdrm-intel1:amd64 2.4.107-8ubuntu1~20.04.2 amd64 Userspace interface to intel-specific kernel DRM services -- runtime ii libdrm-nouveau2:amd64 2.4.107-8ubuntu1~20.04.2 amd64 Userspace interface to nouveau-specific kernel DRM services -- runtime ii libdrm-radeon1:amd64 2.4.107-8ubuntu1~20.04.2 amd64 Userspace interface to radeon-specific kernel DRM services -- runtime ii libdrm2:amd64 2.4.107-8ubuntu1~20.04.2 amd64 Userspace interface to kernel DRM services -- runtime ii linux-firmware 1.187.29 all Firmware for Linux kernel drivers hi linux-generic 5.4.0.105.109 amd64 Complete Generic Linux kernel and headers ii linux-headers-5.4.0-104 5.4.0-104.118 all Header files related to Linux kernel version 5.4.0 ii linux-headers-5.4.0-104-generic 5.4.0-104.118 amd64 Linux kernel headers for version 5.4.0 on 64 bit x86 SMP ii linux-headers-5.4.0-105 5.4.0-105.119 all Header files related to Linux kernel version 5.4.0 ii linux-headers-5.4.0-105-generic 5.4.0-105.119 amd64 Linux kernel headers for version 5.4.0 on 64 bit x86 SMP hi linux-headers-generic 5.4.0.105.109 amd64 Generic Linux kernel headers ii linux-image-5.4.0-104-generic 5.4.0-104.118 amd64 Signed kernel image generic ii linux-image-5.4.0-105-generic 5.4.0-105.119 amd64 Signed kernel image generic hi linux-image-generic 5.4.0.105.109 amd64 Generic Linux kernel image ii linux-libc-dev:amd64 5.4.0-105.119 amd64 Linux Kernel Headers for development ii linux-modules-5.4.0-104-generic 5.4.0-104.118 amd64 Linux kernel extra modules for version 5.4.0 on 64 bit x86 SMP ii linux-modules-5.4.0-105-generic 5.4.0-105.119 amd64 Linux kernel extra modules for version 5.4.0 on 64 bit x86 SMP ii linux-modules-extra-5.4.0-104-generic 5.4.0-104.118 amd64 Linux kernel extra modules for version 5.4.0 on 64 bit x86 SMP ii linux-modules-extra-5.4.0-105-generic 5.4.0-105.119 amd64 Linux kernel extra modules for version 5.4.0 on 64 bit x86 SMP ii mlnx-ofed-kernel-dkms 5.4-OFED.5.4.3.0.3.1 all DKMS support for mlnx-ofed kernel modules ii mlnx-ofed-kernel-utils 5.4-OFED.5.4.3.0.3.1 amd64 Userspace tools to restart and tune mlnx-ofed kernel modules ii mlnx-tools 5.2.0-0.54303 amd64 Userspace tools to restart and tune MLNX_OFED kernel modules ii nvidia-kernel-common-470 470.103.01-0ubuntu0.20.04.1 amd64 Shared files used with the kernel module ii nvidia-kernel-source-470 470.103.01-0ubuntu0.20.04.1 amd64 NVIDIA kernel source package ii nvidia-peer-memory 1.2-0 all nvidia peer memory kernel module. ii nvidia-peer-memory-dkms 1.2-0 all DKMS support for nvidia-peer-memory kernel modules ii rsyslog 8.2001.0-1ubuntu1.1 amd64 reliable system and kernel logging daemon ii srp-dkms 5.4-OFED.5.4.3.0.1.1 all DKMS support fo srp kernel modulesroot@bm-dev-001:~# dpkg -l | egrep -i "kernel | mlnx" root@bm-dev-001:~# dpkg -l | egrep -i "kernel | nvidia" root@bm-dev-001:~# dpkg -l | egrep -i "kernel | linux-image" ii crash 7.2.8-1ubuntu1.20.04.1 amd64 kernel debugging utility, allowing gdb like syntax ii dkms 2.8.1-5ubuntu2 all Dynamic Kernel Module Support Framework ii dmeventd 2:1.02.167-1ubuntu1 amd64 Linux Kernel Device Mapper event daemon ii dmsetup 2:1.02.167-1ubuntu1 amd64 Linux Kernel Device Mapper userspace library ii iser-dkms 5.4-OFED.5.4.3.0.1.1 all DKMS support fo iser kernel modules ii isert-dkms 5.4-OFED.5.4.3.0.1.1 all DKMS support fo isert kernel modules ii kernel-mft-dkms 4.17.2-12 all DKMS support for kernel-mft kernel modules ii kmod 27-1ubuntu2 amd64 tools for managing Linux kernel modules ii knem 1.1.4.90mlnx1-OFED.5.1.2.5.0.1 amd64 userspace tools for the KNEM kernel module ii knem-dkms 1.1.4.90mlnx1-OFED.5.1.2.5.0.1 all DKMS support for mlnx-ofed kernel modules ii libaio1:amd64 0.3.112-5 amd64 Linux kernel AIO access library - shared library ii libdevmapper-event1.02.1:amd64 2:1.02.167-1ubuntu1 amd64 Linux Kernel Device Mapper event support library ii libdevmapper1.02.1:amd64 2:1.02.167-1ubuntu1 amd64 Linux Kernel Device Mapper userspace library ii libdrm-amdgpu1:amd64 2.4.107-8ubuntu1~20.04.2 amd64 Userspace interface to amdgpu-specific kernel DRM services -- runtime ii libdrm-common 2.4.107-8ubuntu1~20.04.2 all Userspace interface to kernel DRM services -- common files ii libdrm-intel1:amd64 2.4.107-8ubuntu1~20.04.2 amd64 Userspace interface to intel-specific kernel DRM services -- runtime ii libdrm-nouveau2:amd64 2.4.107-8ubuntu1~20.04.2 amd64 Userspace interface to nouveau-specific kernel DRM services -- runtime ii libdrm-radeon1:amd64 2.4.107-8ubuntu1~20.04.2 amd64 Userspace interface to radeon-specific kernel DRM services -- runtime ii libdrm2:amd64 2.4.107-8ubuntu1~20.04.2 amd64 Userspace interface to kernel DRM services -- runtime ii linux-firmware 1.187.29 all Firmware for Linux kernel drivers hi linux-generic 5.4.0.105.109 amd64 Complete Generic Linux kernel and headers ii linux-headers-5.4.0-104 5.4.0-104.118 all Header files related to Linux kernel version 5.4.0 ii linux-headers-5.4.0-104-generic 5.4.0-104.118 amd64 Linux kernel headers for version 5.4.0 on 64 bit x86 SMP ii linux-headers-5.4.0-105 5.4.0-105.119 all Header files related to Linux kernel version 5.4.0 ii linux-headers-5.4.0-105-generic 5.4.0-105.119 amd64 Linux kernel headers for version 5.4.0 on 64 bit x86 SMP hi linux-headers-generic 5.4.0.105.109 amd64 Generic Linux kernel headers ii linux-image-5.4.0-104-generic 5.4.0-104.118 amd64 Signed kernel image generic ii linux-image-5.4.0-105-generic 5.4.0-105.119 amd64 Signed kernel image generic hi linux-image-generic 5.4.0.105.109 amd64 Generic Linux kernel image ii linux-libc-dev:amd64 5.4.0-105.119 amd64 Linux Kernel Headers for development ii linux-modules-5.4.0-104-generic 5.4.0-104.118 amd64 Linux kernel extra modules for version 5.4.0 on 64 bit x86 SMP ii linux-modules-5.4.0-105-generic 5.4.0-105.119 amd64 Linux kernel extra modules for version 5.4.0 on 64 bit x86 SMP ii linux-modules-extra-5.4.0-104-generic 5.4.0-104.118 amd64 Linux kernel extra modules for version 5.4.0 on 64 bit x86 SMP ii linux-modules-extra-5.4.0-105-generic 5.4.0-105.119 amd64 Linux kernel extra modules for version 5.4.0 on 64 bit x86 SMP ii mlnx-ofed-kernel-dkms 5.4-OFED.5.4.3.0.3.1 all DKMS support for mlnx-ofed kernel modules ii mlnx-ofed-kernel-utils 5.4-OFED.5.4.3.0.3.1 amd64 Userspace tools to restart and tune mlnx-ofed kernel modules ii mlnx-tools 5.2.0-0.54303 amd64 Userspace tools to restart and tune MLNX_OFED kernel modules ii nvidia-kernel-common-470 470.103.01-0ubuntu0.20.04.1 amd64 Shared files used with the kernel module ii nvidia-kernel-source-470 470.103.01-0ubuntu0.20.04.1 amd64 NVIDIA kernel source package ii nvidia-peer-memory 1.2-0 all nvidia peer memory kernel module. ii nvidia-peer-memory-dkms 1.2-0 all DKMS support for nvidia-peer-memory kernel modules ii rsyslog 8.2001.0-1ubuntu1.1 amd64 reliable system and kernel logging daemon ii srp-dkms 5.4-OFED.5.4.3.0.1.1 all DKMS support fo srp kernel modulesCode block. Kernel, IB related package version check - Use the apt-mark command to hold the package update.Color mode
# apt-mark hold <package name># apt-mark hold <package name>Code block. Package update hold
Intel E810 Driver Update
Check the version of the Intel E810 driver and update it to the recommended version.
- Server manufacturer Intel E810 driver recommended version: 1.15.4
- Driver (ice-1.15.4.tar.gz) download
The driver update method is as follows.
- Move the basic driver tar file to the desired directory.
Example: /home/username/ice or /usr/local/src/ice
Untar / unzip the Archiver file.
- x.x.x is the version number of the driver tar file.Color mode
tar zxf ice-x.x.x.tar.gztar zxf ice-x.x.x.tar.gzCode block. Unzip file
- x.x.x is the version number of the driver tar file.
Change to the driver src directory.
- x.x.x is the version number of the driver tar file.Color mode
cd ice-x.x.x/src/cd ice-x.x.x/src/Code block. Directory change
- x.x.x is the version number of the driver tar file.
Compile the driver module.
Color modemake installmake installCode Block. Driver Module Compile After the update is complete, check the version.
Color modelsmod | grep ice modinfo ice | grep versionlsmod | grep ice modinfo ice | grep versionCode Block. Version Check
NVIDIA driver check
nvidia-smi topo, IB nv_peer_mem status checkTo check the NVIDIA driver (nvidia-smi topo, IB nv_peer_mem status) and inspect the IaaS HW level, follow the next procedure.
Check the GPU driver and HW status.
Color modeuser@bm-dev-001:~$ nvidia-smi topo -m GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx5_0 mlx5_1 mlx5_2 mlx5_3 CPU Affinity NUMA Affinity GPU0 X NV12 NV12 NV12 NV12 NV12 NV12 NV12 SYS PXB SYS SYS 48-63 3 GPU1 NV12 X NV12 NV12 NV12 NV12 NV12 NV12 SYS PXB SYS SYS 48-63 3 GPU2 NV12 NV12 X NV12 NV12 NV12 NV12 NV12 PXB SYS SYS SYS 16-31 1 GPU3 NV12 NV12 NV12 X NV12 NV12 NV12 NV12 PXB SYS SYS SYS 16-31 1 GPU4 NV12 NV12 NV12 NV12 X NV12 NV12 NV12 SYS SYS SYS PXB 112-127 7 GPU5 NV12 NV12 NV12 NV12 NV12 X NV12 NV12 SYS SYS SYS PXB 112-127 7 GPU6 NV12 NV12 NV12 NV12 NV12 NV12 X NV12 SYS SYS PXB SYS 80-95 5 GPU7 NV12 NV12 NV12 NV12 NV12 NV12 NV12 X SYS SYS PXB SYS 80-95 5 mlx5_0 SYS SYS PXB PXB SYS SYS SYS SYS X SYS SYS SYS mlx5_1 PXB PXB SYS SYS SYS SYS SYS SYS SYS X SYS SYS mlx5_2 SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS X SYS mlx5_3 SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS X Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinksuser@bm-dev-001:~$ nvidia-smi topo -m GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx5_0 mlx5_1 mlx5_2 mlx5_3 CPU Affinity NUMA Affinity GPU0 X NV12 NV12 NV12 NV12 NV12 NV12 NV12 SYS PXB SYS SYS 48-63 3 GPU1 NV12 X NV12 NV12 NV12 NV12 NV12 NV12 SYS PXB SYS SYS 48-63 3 GPU2 NV12 NV12 X NV12 NV12 NV12 NV12 NV12 PXB SYS SYS SYS 16-31 1 GPU3 NV12 NV12 NV12 X NV12 NV12 NV12 NV12 PXB SYS SYS SYS 16-31 1 GPU4 NV12 NV12 NV12 NV12 X NV12 NV12 NV12 SYS SYS SYS PXB 112-127 7 GPU5 NV12 NV12 NV12 NV12 NV12 X NV12 NV12 SYS SYS SYS PXB 112-127 7 GPU6 NV12 NV12 NV12 NV12 NV12 NV12 X NV12 SYS SYS PXB SYS 80-95 5 GPU7 NV12 NV12 NV12 NV12 NV12 NV12 NV12 X SYS SYS PXB SYS 80-95 5 mlx5_0 SYS SYS PXB PXB SYS SYS SYS SYS X SYS SYS SYS mlx5_1 PXB PXB SYS SYS SYS SYS SYS SYS SYS X SYS SYS mlx5_2 SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS X SYS mlx5_3 SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS X Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinksCode Block. GPU Driver and HW Status Check Check the NVSwitch HW status.
Color modeuser@bm-dev-001:~$ nvidia-smi nvlink --status GPU 0: NVIDIA A100-SXM4-80GB (UUID: GPU-2c0d1d6b-e348-55fc-44cf-cd65a954b36c) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 1: NVIDIA A100-SXM4-80GB (UUID: GPU-96f429d8-893a-a9ea-deca-feffd90669e9) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 2: NVIDIA A100-SXM4-80GB (UUID: GPU-2e601952-b442-b757-a035-725cd320f589) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 3: NVIDIA A100-SXM4-80GB (UUID: GPU-bcbfd885-a9f8-ec8c-045b-c521472b4fed) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 4: NVIDIA A100-SXM4-80GB (UUID: GPU-30273090-2d78-fc7a-a360-ec5f871dd488) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 5: NVIDIA A100-SXM4-80GB (UUID: GPU-5ce7ef61-56dd-fb18-aa7c-be610c8d51c3) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 6: NVIDIA A100-SXM4-80GB (UUID: GPU-740a527b-b286-8b85-35eb-b6b41c0bb6d7) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 7: NVIDIA A100-SXM4-80GB (UUID: GPU-1fb6de95-60f6-dbf2-ffca-f7680577e37c) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/suser@bm-dev-001:~$ nvidia-smi nvlink --status GPU 0: NVIDIA A100-SXM4-80GB (UUID: GPU-2c0d1d6b-e348-55fc-44cf-cd65a954b36c) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 1: NVIDIA A100-SXM4-80GB (UUID: GPU-96f429d8-893a-a9ea-deca-feffd90669e9) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 2: NVIDIA A100-SXM4-80GB (UUID: GPU-2e601952-b442-b757-a035-725cd320f589) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 3: NVIDIA A100-SXM4-80GB (UUID: GPU-bcbfd885-a9f8-ec8c-045b-c521472b4fed) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 4: NVIDIA A100-SXM4-80GB (UUID: GPU-30273090-2d78-fc7a-a360-ec5f871dd488) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 5: NVIDIA A100-SXM4-80GB (UUID: GPU-5ce7ef61-56dd-fb18-aa7c-be610c8d51c3) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 6: NVIDIA A100-SXM4-80GB (UUID: GPU-740a527b-b286-8b85-35eb-b6b41c0bb6d7) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 7: NVIDIA A100-SXM4-80GB (UUID: GPU-1fb6de95-60f6-dbf2-ffca-f7680577e37c) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/sCode block. NVSwitch HW status check Check the InfiniBand(IB) HCA card HW status and Link.
Color modeuser@bm-dev-001:~$ ibdev2netdev -v cat: /sys/class/infiniband/mlx5_0/device/vpd: Permission denied 0000:45:00.0 mlx5_0 (MT4123 - ) fw 20.29.1016 port 1 (ACTIVE) ==> ibs18 (Down) cat: /sys/class/infiniband/mlx5_1/device/vpd: Permission denied 0000:0e:00.0 mlx5_1 (MT4123 - ) fw 20.29.1016 port 1 (ACTIVE) ==> ibs17 (Down) cat: /sys/class/infiniband/mlx5_2/device/vpd: Permission denied 0000:c5:00.0 mlx5_2 (MT4123 - ) fw 20.29.1016 port 1 (ACTIVE) ==> ibs20 (Down) cat: /sys/class/infiniband/mlx5_3/device/vpd: Permission denied 0000:85:00.0 mlx5_3 (MT4123 - ) fw 20.29.1016 port 1 (ACTIVE) ==> ibs19 (Down) user@bm-dev-001:~$ root@bm-dev-001:~# ibstat CA 'mlx5_0' CA type: MT4123 Number of ports: 1 Firmware version: 20.29.1016 Hardware version: 0 Node GUID: 0x88e9a4ffff5060ac System image GUID: 0x88e9a4ffff5060ac Port 1: State: Active Physical state: LinkUp Rate: 200 Base lid: 8 LMC: 0 SM lid: 1 Capability mask: 0x2651e848 Port GUID: 0x88e9a4ffff5060ac Link layer: InfiniBand CA 'mlx5_1' CA type: MT4123 Number of ports: 1 Firmware version: 20.29.1016 Hardware version: 0 Node GUID: 0x88e9a4ffff504080 System image GUID: 0x88e9a4ffff504080 Port 1: State: Active Physical state: LinkUp Rate: 200 Base lid: 5 LMC: 0 SM lid: 1 Capability mask: 0x2651e848 Port GUID: 0x88e9a4ffff504080 Link layer: InfiniBand CA 'mlx5_2' CA type: MT4123 Number of ports: 1 Firmware version: 20.29.1016 Hardware version: 0 Node GUID: 0x88e9a4ffff505038 System image GUID: 0x88e9a4ffff505038 Port 1: State: Active Physical state: LinkUp Rate: 200 Base lid: 2 LMC: 0 SM lid: 1 Capability mask: 0x2651e848 Port GUID: 0x88e9a4ffff505038 Link layer: InfiniBand CA 'mlx5_3' CA type: MT4123 Number of ports: 1 Firmware version: 20.29.1016 Hardware version: 0 Node GUID: 0x88e9a4ffff504094 System image GUID: 0x88e9a4ffff504094 Port 1: State: Active Physical state: LinkUp Rate: 200 Base lid: 7 LMC: 0 SM lid: 1 Capability mask: 0x2651e848 Port GUID: 0x88e9a4ffff504094 Link layer: InfiniBanduser@bm-dev-001:~$ ibdev2netdev -v cat: /sys/class/infiniband/mlx5_0/device/vpd: Permission denied 0000:45:00.0 mlx5_0 (MT4123 - ) fw 20.29.1016 port 1 (ACTIVE) ==> ibs18 (Down) cat: /sys/class/infiniband/mlx5_1/device/vpd: Permission denied 0000:0e:00.0 mlx5_1 (MT4123 - ) fw 20.29.1016 port 1 (ACTIVE) ==> ibs17 (Down) cat: /sys/class/infiniband/mlx5_2/device/vpd: Permission denied 0000:c5:00.0 mlx5_2 (MT4123 - ) fw 20.29.1016 port 1 (ACTIVE) ==> ibs20 (Down) cat: /sys/class/infiniband/mlx5_3/device/vpd: Permission denied 0000:85:00.0 mlx5_3 (MT4123 - ) fw 20.29.1016 port 1 (ACTIVE) ==> ibs19 (Down) user@bm-dev-001:~$ root@bm-dev-001:~# ibstat CA 'mlx5_0' CA type: MT4123 Number of ports: 1 Firmware version: 20.29.1016 Hardware version: 0 Node GUID: 0x88e9a4ffff5060ac System image GUID: 0x88e9a4ffff5060ac Port 1: State: Active Physical state: LinkUp Rate: 200 Base lid: 8 LMC: 0 SM lid: 1 Capability mask: 0x2651e848 Port GUID: 0x88e9a4ffff5060ac Link layer: InfiniBand CA 'mlx5_1' CA type: MT4123 Number of ports: 1 Firmware version: 20.29.1016 Hardware version: 0 Node GUID: 0x88e9a4ffff504080 System image GUID: 0x88e9a4ffff504080 Port 1: State: Active Physical state: LinkUp Rate: 200 Base lid: 5 LMC: 0 SM lid: 1 Capability mask: 0x2651e848 Port GUID: 0x88e9a4ffff504080 Link layer: InfiniBand CA 'mlx5_2' CA type: MT4123 Number of ports: 1 Firmware version: 20.29.1016 Hardware version: 0 Node GUID: 0x88e9a4ffff505038 System image GUID: 0x88e9a4ffff505038 Port 1: State: Active Physical state: LinkUp Rate: 200 Base lid: 2 LMC: 0 SM lid: 1 Capability mask: 0x2651e848 Port GUID: 0x88e9a4ffff505038 Link layer: InfiniBand CA 'mlx5_3' CA type: MT4123 Number of ports: 1 Firmware version: 20.29.1016 Hardware version: 0 Node GUID: 0x88e9a4ffff504094 System image GUID: 0x88e9a4ffff504094 Port 1: State: Active Physical state: LinkUp Rate: 200 Base lid: 7 LMC: 0 SM lid: 1 Capability mask: 0x2651e848 Port GUID: 0x88e9a4ffff504094 Link layer: InfiniBandCode block. InfiniBand(IB) HCA card HW status and Link check
IB bandwidth communication check
To check the IB bandwidth communication status (ib_send_bw) and inspect the IaaS HW level, follow these steps.
Check the name of the IB HCA interface.
Color modeuser@bm-dev-001:~$ ibdev2netdev mlx5_0 port 1 ==> ibs18 (Down) mlx5_1 port 1 ==> ibs17 (Down) mlx5_2 port 1 ==> ibs20 (Down) mlx5_3 port 1 ==> ibs19 (Down)user@bm-dev-001:~$ ibdev2netdev mlx5_0 port 1 ==> ibs18 (Down) mlx5_1 port 1 ==> ibs17 (Down) mlx5_2 port 1 ==> ibs20 (Down) mlx5_3 port 1 ==> ibs19 (Down)Code block. Check the name of IB HCA interface Check the HCA interface that can communicate with IB Switch#1.
Color modemlx5_0 port 1 ==> ibs18 (Down) mlx5_2 port 1 ==> ibs20 (Down)mlx5_0 port 1 ==> ibs18 (Down) mlx5_2 port 1 ==> ibs20 (Down)Code Block. HCA Interface Check Check the HCA interface that can communicate with IB Switch#2.
Color modemlx5_1 port 1 ==> ibs17 (Down) mlx5_3 port 1 ==> ibs19 (Down)mlx5_1 port 1 ==> ibs17 (Down) mlx5_3 port 1 ==> ibs19 (Down)Code Block. HCA Interface Check Use SERVER Side commands to check the communication status.
Client Sidecommand is entered secondarily for mutual communicationColor modeuser@bm-dev-001:~$ ib_send_bw -d mlx5_3 -i 1 –F ************************************ * Waiting for client to connect... * ************************************ --------------------------------------------------------------------------------------- Send BW Test Dual-port : OFF Device : mlx5_3 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON ibv_wr* API : ON RX depth : 512 CQ Moderation : 1 Mtu : 4096[B] Link type : IB Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0x07 QPN 0x002e PSN 0xa86622 remote address: LID 0x0a QPN 0x002d PSN 0xfc58dd --------------------------------------------------------------------------------------- #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps] 65536 1000 0.00 19827.40 0.317238 ---------------------------------------------------------------------------------------user@bm-dev-001:~$ ib_send_bw -d mlx5_3 -i 1 –F ************************************ * Waiting for client to connect... * ************************************ --------------------------------------------------------------------------------------- Send BW Test Dual-port : OFF Device : mlx5_3 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON ibv_wr* API : ON RX depth : 512 CQ Moderation : 1 Mtu : 4096[B] Link type : IB Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0x07 QPN 0x002e PSN 0xa86622 remote address: LID 0x0a QPN 0x002d PSN 0xfc58dd --------------------------------------------------------------------------------------- #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps] 65536 1000 0.00 19827.40 0.317238 ---------------------------------------------------------------------------------------Code Block. Communication Status Check
Use the
CLIENT Sidecommand to check the communication status.SERVER Sidecommand is entered first for mutual communicationColor moderoot@bm-dev-003:~# ib_send_bw -d mlx5_3 -i 1 -F <SERVER Side IP> --------------------------------------------------------------------------------------- Send BW Test Dual-port : OFF Device : mlx5_3 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON ibv_wr* API : ON TX depth : 128 CQ Moderation : 1 Mtu : 4096[B] Link type : IB Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0x0a QPN 0x002a PSN 0x98a48e remote address: LID 0x07 QPN 0x002c PSN 0xe68304 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps] 65536 1000 19008.49 19006.37 0.304102 ---------------------------------------------------------------------------------------root@bm-dev-003:~# ib_send_bw -d mlx5_3 -i 1 -F <SERVER Side IP> --------------------------------------------------------------------------------------- Send BW Test Dual-port : OFF Device : mlx5_3 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON ibv_wr* API : ON TX depth : 128 CQ Moderation : 1 Mtu : 4096[B] Link type : IB Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0x0a QPN 0x002a PSN 0x98a48e remote address: LID 0x07 QPN 0x002c PSN 0xe68304 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps] 65536 1000 19008.49 19006.37 0.304102 ---------------------------------------------------------------------------------------Code Block. Communication Status Check
Check IB Service Related Kernel Modules
Check the IB service-related kernel modules (lsmod) to inspect the IaaS HW level.
user@bm-dev-001:~$ lsmod | grep nv_peer_mem
nv_peer_mem 16384 0
ib_core 315392 9 rdma_cm,ib_ipoib,nv_peer_mem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
nvidia 35315712 156 nvidia_uvm,nv_peer_mem,nvidia_modesetuser@bm-dev-001:~$ lsmod | grep nv_peer_mem
nv_peer_mem 16384 0
ib_core 315392 9 rdma_cm,ib_ipoib,nv_peer_mem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
nvidia 35315712 156 nvidia_uvm,nv_peer_mem,nvidia_modesetuser@bm-dev-001:~$ service nv_peer_mem status
nv_peer_mem.service - LSB: Activates/Deactivates nv_peer_mem to \ start at boot time.
Loaded: loaded (/etc/init.d/nv_peer_mem; generated)
Active: active (exited) since Mon 2023-03-13 16:21:33 KST; 2 days ago
Docs: man:systemd-sysv-generator(8)
Process: 4913 ExecStart=/etc/init.d/nv_peer_mem start (code=exited, status=0/SUCCESS)user@bm-dev-001:~$ service nv_peer_mem status
nv_peer_mem.service - LSB: Activates/Deactivates nv_peer_mem to \ start at boot time.
Loaded: loaded (/etc/init.d/nv_peer_mem; generated)
Active: active (exited) since Mon 2023-03-13 16:21:33 KST; 2 days ago
Docs: man:systemd-sysv-generator(8)
Process: 4913 ExecStart=/etc/init.d/nv_peer_mem start (code=exited, status=0/SUCCESS)user@bm-dev-001:~$ lsmod | grep ib
libiscsi_tcp 32768 1 iscsi_tcp
libiscsi 57344 2 libiscsi_tcp,iscsi_tcp
scsi_transport_iscsi 110592 4 libiscsi_tcp,iscsi_tcp,libiscsi
ib_ipoib 131072 0
ib_cm 57344 2 rdma_cm,ib_ipoib
ib_umad 24576 8
mlx5_ib 380928 0
ib_uverbs 135168 18 rdma_ucm,mlx5_ib
ib_core 315392 9 rdma_cm,ib_ipoib,nv_peer_mem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
libcrc32c 16384 2 btrfs,raid456
mlx5_core 1458176 1 mlx5_ib
auxiliary 16384 2 mlx5_ib,mlx5_core
mlx_compat 65536 12 rdma_cm,ib_ipoib,mlxdevm,iw_cm,auxiliary,ib_umad,ib_core,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm,mlx5_coreuser@bm-dev-001:~$ lsmod | grep ib
libiscsi_tcp 32768 1 iscsi_tcp
libiscsi 57344 2 libiscsi_tcp,iscsi_tcp
scsi_transport_iscsi 110592 4 libiscsi_tcp,iscsi_tcp,libiscsi
ib_ipoib 131072 0
ib_cm 57344 2 rdma_cm,ib_ipoib
ib_umad 24576 8
mlx5_ib 380928 0
ib_uverbs 135168 18 rdma_ucm,mlx5_ib
ib_core 315392 9 rdma_cm,ib_ipoib,nv_peer_mem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
libcrc32c 16384 2 btrfs,raid456
mlx5_core 1458176 1 mlx5_ib
auxiliary 16384 2 mlx5_ib,mlx5_core
mlx_compat 65536 12 rdma_cm,ib_ipoib,mlxdevm,iw_cm,auxiliary,ib_umad,ib_core,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm,mlx5_coreStorage Physical Disk Resources and Multi-Path Check
Check the IaaS HW level by checking the storage physical disk resources and Multi-Path.
root@bm-dev-002:/tmp# fdisk –lroot@bm-dev-002:/tmp# fdisk –lroot@bm-dev-002:/tmp# multipath –llroot@bm-dev-002:/tmp# multipath –llMulti-node GPU Cluster new deployment after checking Service Network
Use the following command to check if the MII Status of Bonding and Slave Interface is up.
command
Color moderoot@mngc-001:~# cat /proc/net/bonding/bond-srv Ethernet Channel Bonding Driver: v5.15.0-25-genericroot@mngc-001:~# cat /proc/net/bonding/bond-srv Ethernet Channel Bonding Driver: v5.15.0-25-genericCode Block. Service Network Check Command confirmation result
Color modeBonding Mode: fault-tolerance (active-backup) Primary Slave: None Currently Active Slave: ens9f0 MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 0 Down Delay (ms): 0 Peer Notification Delay (ms): 0 Slave Interface: ens9f0 MII Status: up Speed: 100000 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: 30:3e:a7:02:35:70 Slave queue ID: 0 Slave Interface: ens11f0 MII Status: up Speed: 100000 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: 30:3e:a7:02:2f:e8 Slave queue ID: 0Bonding Mode: fault-tolerance (active-backup) Primary Slave: None Currently Active Slave: ens9f0 MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 0 Down Delay (ms): 0 Peer Notification Delay (ms): 0 Slave Interface: ens9f0 MII Status: up Speed: 100000 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: 30:3e:a7:02:35:70 Slave queue ID: 0 Slave Interface: ens11f0 MII Status: up Speed: 100000 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: 30:3e:a7:02:2f:e8 Slave queue ID: 0Code Block. Service Network Check Command Result
Multi-node GPU Cluster new deployment after checking Time Server and time synchronization
The OS image has the chrony daemon installed and set to synchronize with the SCP NTP server. Use the following command to check if there are any lines marked with ^* in the MS Name column.
command
Color moderoot@mngc-001:~# chronyc sources -Vroot@mngc-001:~# chronyc sources -VCode Block. chrony daemon installation command confirmation result
Color modeMS Name/IP address Stratum Poll Reach LastRx Last sample =============================================================================== ^+ 198.19.1.53 4 10 377 1040 -16us[ -37us] +/- 9982us ^* 198.19.1.54 4 10 377 312 -367us[ -388us] +/- 13msMS Name/IP address Stratum Poll Reach LastRx Last sample =============================================================================== ^+ 198.19.1.53 4 10 377 1040 -16us[ -37us] +/- 9982us ^* 198.19.1.54 4 10 377 312 -367us[ -388us] +/- 13msCode block. chrony daemon installation check result
GPU MIG/ECC Setting Initialization Check Guide
When applying for a multi-node GPU cluster product, the GPU MIG/ECC setting is initialized. However, to apply the exact setting value, please restart it once at the beginning, and then check and use it according to the inspection guide to see if the setting value is applied.
- MIG: Multi-Instance GPU
- ECC: Error Correction Code
MIG Setup Initialization
Refer to the following for how to check and initialize MIG settings.
Use the following command to check if the status value of MIG M is Disabled.
command
Color moderoot@bm-dev-001:~#nvidia-smiroot@bm-dev-001:~#nvidia-smiCode Block. MIG M. Initialize Settings confirmation result
Color mode+-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 470.129.06 Driver version: 470.129.06 CUDA Version: 11.4 | |----------------------------------+-----------------------------+------------------------| | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |==================================+=============================+========================| | 0 NVIDIA A100-SXM... Off | 00000000:03:00.0 Off | Off | | N/A 29C P0 57W / 400W | 0MiB / 81251MiB | 0% Default | | | | Disabled | +----------------------------------+-----------------------------+------------------------+ | 0 NVIDIA A100-SXM... Off | 00000000:0C:00.0 Off | Off | | N/A 30C P0 58W / 400W | 0MiB / 81251MiB | 18% Default | | | | Disabled | +-----------------------------------------------------------------------------------------++-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 470.129.06 Driver version: 470.129.06 CUDA Version: 11.4 | |----------------------------------+-----------------------------+------------------------| | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |==================================+=============================+========================| | 0 NVIDIA A100-SXM... Off | 00000000:03:00.0 Off | Off | | N/A 29C P0 57W / 400W | 0MiB / 81251MiB | 0% Default | | | | Disabled | +----------------------------------+-----------------------------+------------------------+ | 0 NVIDIA A100-SXM... Off | 00000000:0C:00.0 Off | Off | | N/A 30C P0 58W / 400W | 0MiB / 81251MiB | 18% Default | | | | Disabled | +-----------------------------------------------------------------------------------------+Code Block. MIG M. Initialization Setting Check Result If MIG M.’s status value is not Disabled, use the following command to initialize MIG.
Color moderoot@bm-dev-001:~# nvidia-smi -mig 0 root@bm-dev-001:~# nvidia-smi --gpu-resetroot@bm-dev-001:~# nvidia-smi -mig 0 root@bm-dev-001:~# nvidia-smi --gpu-resetCode Block. MIG M. Status Value Initialization
ECC Setting Initialization
Refer to the following for how to check and initialize the ECC settings.
Use the following command to check if the status value of Volatile Uncorr. ECC is Off.
command
Color moderoot@bm-dev-001:~#nvidia-smiroot@bm-dev-001:~#nvidia-smiCode Block. ECC Setting Command confirmation result
Color mode+-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 470.129.06 Driver version: 470.129.06 CUDA Version: 11.4 | |----------------------------------+-----------------------------+------------------------| | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |==================================+=============================+========================| | 0 NVIDIA A100-SXM... Off | 00000000:03:00.0 Off | Off | | N/A 29C P0 57W / 400W | 0MiB / 81251MiB | 0% Default | | | | Disabled | +----------------------------------+-----------------------------+------------------------+ | 0 NVIDIA A100-SXM... Off | 00000000:0C:00.0 Off | Off | | N/A 30C P0 61W / 400W | 0MiB / 81251MiB | 18% Default | | | | Disabled | +-----------------------------------------------------------------------------------------++-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 470.129.06 Driver version: 470.129.06 CUDA Version: 11.4 | |----------------------------------+-----------------------------+------------------------| | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |==================================+=============================+========================| | 0 NVIDIA A100-SXM... Off | 00000000:03:00.0 Off | Off | | N/A 29C P0 57W / 400W | 0MiB / 81251MiB | 0% Default | | | | Disabled | +----------------------------------+-----------------------------+------------------------+ | 0 NVIDIA A100-SXM... Off | 00000000:0C:00.0 Off | Off | | N/A 30C P0 61W / 400W | 0MiB / 81251MiB | 18% Default | | | | Disabled | +-----------------------------------------------------------------------------------------+Code Block. ECC Setting Check Result Volatile Uncorr. ECC’s status value is On*, please proceed with rebooting.
Volatile Uncorr. ECC status value is not On* or Off, use the following command to initialize ECC. After initialization, reboot and check if the status value is Off.
root@bm-dev-001:~# nvidia-smi --ecc-config=0root@bm-dev-001:~# nvidia-smi --ecc-config=0