Multi-node GPU Cluster 서비스 범위 및 점검 가이드
Multi-node GPU Cluster 서비스 범위
Multi-node GPU Cluster 서비스의 IaaS HW 레벨 문제 발생 시 Support Center의 문의하기를 통해 기술 지원을 받을 수 있습니다. 하지만 OS Kernel 업데이트 또는 애플리케이션 설치 등의 변경에 따른 리스크는 사용자의 영역이므로 기술 지원이 어려우니, 시스템 업데이트 등의 작업에 유의해 주시기 바랍니다.
IaaS HW 레벨 문제
- IPMI(iLO) HW모니터링 콘솔에서 발생하는 서버 내의 HW fault event 발생 메시지
- nvdia-smi 명령에서 확인되는 GPU HW 동작 오류
- InfiniBand HCA 카드 또는 InfiniBand Switch 점검에서 발생하는 HW 오류 메시지
IaaS HW 점검 가이드
Multi-node GPU Cluster 서비스를 신청한 후에는 점검 가이드에 따라 IaaS HW 레벨을 점검하는 것을 권장합니다.
OS Kernel 및 Package holding
- 패키지 버전의 자동 업데이트를 원하지 않을 경우,
apt-mark명령어로 패키지 업데이트를 차단하는 것을 권장합니다. - Linux 커널이나 IB 관련된 패키지 버전의 업데이트 차단을 권장합니다.
OS Kernel 및 Package holding을 진행하려면 다음 절차를 따르세요.
- 다음 명령어를 사용하여 커널과 IB 관련된 패키지 버전을 확인하세요.배경색 변경
root@bm-dev-001:~# dpkg -l | egrep -i "kernel | mlnx" root@bm-dev-001:~# dpkg -l | egrep -i "kernel | nvidia" root@bm-dev-001:~# dpkg -l | egrep -i "kernel | linux-image" ii crash 7.2.8-1ubuntu1.20.04.1 amd64 kernel debugging utility, allowing gdb like syntax ii dkms 2.8.1-5ubuntu2 all Dynamic Kernel Module Support Framework ii dmeventd 2:1.02.167-1ubuntu1 amd64 Linux Kernel Device Mapper event daemon ii dmsetup 2:1.02.167-1ubuntu1 amd64 Linux Kernel Device Mapper userspace library ii iser-dkms 5.4-OFED.5.4.3.0.1.1 all DKMS support fo iser kernel modules ii isert-dkms 5.4-OFED.5.4.3.0.1.1 all DKMS support fo isert kernel modules ii kernel-mft-dkms 4.17.2-12 all DKMS support for kernel-mft kernel modules ii kmod 27-1ubuntu2 amd64 tools for managing Linux kernel modules ii knem 1.1.4.90mlnx1-OFED.5.1.2.5.0.1 amd64 userspace tools for the KNEM kernel module ii knem-dkms 1.1.4.90mlnx1-OFED.5.1.2.5.0.1 all DKMS support for mlnx-ofed kernel modules ii libaio1:amd64 0.3.112-5 amd64 Linux kernel AIO access library - shared library ii libdevmapper-event1.02.1:amd64 2:1.02.167-1ubuntu1 amd64 Linux Kernel Device Mapper event support library ii libdevmapper1.02.1:amd64 2:1.02.167-1ubuntu1 amd64 Linux Kernel Device Mapper userspace library ii libdrm-amdgpu1:amd64 2.4.107-8ubuntu1~20.04.2 amd64 Userspace interface to amdgpu-specific kernel DRM services -- runtime ii libdrm-common 2.4.107-8ubuntu1~20.04.2 all Userspace interface to kernel DRM services -- common files ii libdrm-intel1:amd64 2.4.107-8ubuntu1~20.04.2 amd64 Userspace interface to intel-specific kernel DRM services -- runtime ii libdrm-nouveau2:amd64 2.4.107-8ubuntu1~20.04.2 amd64 Userspace interface to nouveau-specific kernel DRM services -- runtime ii libdrm-radeon1:amd64 2.4.107-8ubuntu1~20.04.2 amd64 Userspace interface to radeon-specific kernel DRM services -- runtime ii libdrm2:amd64 2.4.107-8ubuntu1~20.04.2 amd64 Userspace interface to kernel DRM services -- runtime ii linux-firmware 1.187.29 all Firmware for Linux kernel drivers hi linux-generic 5.4.0.105.109 amd64 Complete Generic Linux kernel and headers ii linux-headers-5.4.0-104 5.4.0-104.118 all Header files related to Linux kernel version 5.4.0 ii linux-headers-5.4.0-104-generic 5.4.0-104.118 amd64 Linux kernel headers for version 5.4.0 on 64 bit x86 SMP ii linux-headers-5.4.0-105 5.4.0-105.119 all Header files related to Linux kernel version 5.4.0 ii linux-headers-5.4.0-105-generic 5.4.0-105.119 amd64 Linux kernel headers for version 5.4.0 on 64 bit x86 SMP hi linux-headers-generic 5.4.0.105.109 amd64 Generic Linux kernel headers ii linux-image-5.4.0-104-generic 5.4.0-104.118 amd64 Signed kernel image generic ii linux-image-5.4.0-105-generic 5.4.0-105.119 amd64 Signed kernel image generic hi linux-image-generic 5.4.0.105.109 amd64 Generic Linux kernel image ii linux-libc-dev:amd64 5.4.0-105.119 amd64 Linux Kernel Headers for development ii linux-modules-5.4.0-104-generic 5.4.0-104.118 amd64 Linux kernel extra modules for version 5.4.0 on 64 bit x86 SMP ii linux-modules-5.4.0-105-generic 5.4.0-105.119 amd64 Linux kernel extra modules for version 5.4.0 on 64 bit x86 SMP ii linux-modules-extra-5.4.0-104-generic 5.4.0-104.118 amd64 Linux kernel extra modules for version 5.4.0 on 64 bit x86 SMP ii linux-modules-extra-5.4.0-105-generic 5.4.0-105.119 amd64 Linux kernel extra modules for version 5.4.0 on 64 bit x86 SMP ii mlnx-ofed-kernel-dkms 5.4-OFED.5.4.3.0.3.1 all DKMS support for mlnx-ofed kernel modules ii mlnx-ofed-kernel-utils 5.4-OFED.5.4.3.0.3.1 amd64 Userspace tools to restart and tune mlnx-ofed kernel modules ii mlnx-tools 5.2.0-0.54303 amd64 Userspace tools to restart and tune MLNX_OFED kernel modules ii nvidia-kernel-common-470 470.103.01-0ubuntu0.20.04.1 amd64 Shared files used with the kernel module ii nvidia-kernel-source-470 470.103.01-0ubuntu0.20.04.1 amd64 NVIDIA kernel source package ii nvidia-peer-memory 1.2-0 all nvidia peer memory kernel module. ii nvidia-peer-memory-dkms 1.2-0 all DKMS support for nvidia-peer-memory kernel modules ii rsyslog 8.2001.0-1ubuntu1.1 amd64 reliable system and kernel logging daemon ii srp-dkms 5.4-OFED.5.4.3.0.1.1 all DKMS support fo srp kernel modulesroot@bm-dev-001:~# dpkg -l | egrep -i "kernel | mlnx" root@bm-dev-001:~# dpkg -l | egrep -i "kernel | nvidia" root@bm-dev-001:~# dpkg -l | egrep -i "kernel | linux-image" ii crash 7.2.8-1ubuntu1.20.04.1 amd64 kernel debugging utility, allowing gdb like syntax ii dkms 2.8.1-5ubuntu2 all Dynamic Kernel Module Support Framework ii dmeventd 2:1.02.167-1ubuntu1 amd64 Linux Kernel Device Mapper event daemon ii dmsetup 2:1.02.167-1ubuntu1 amd64 Linux Kernel Device Mapper userspace library ii iser-dkms 5.4-OFED.5.4.3.0.1.1 all DKMS support fo iser kernel modules ii isert-dkms 5.4-OFED.5.4.3.0.1.1 all DKMS support fo isert kernel modules ii kernel-mft-dkms 4.17.2-12 all DKMS support for kernel-mft kernel modules ii kmod 27-1ubuntu2 amd64 tools for managing Linux kernel modules ii knem 1.1.4.90mlnx1-OFED.5.1.2.5.0.1 amd64 userspace tools for the KNEM kernel module ii knem-dkms 1.1.4.90mlnx1-OFED.5.1.2.5.0.1 all DKMS support for mlnx-ofed kernel modules ii libaio1:amd64 0.3.112-5 amd64 Linux kernel AIO access library - shared library ii libdevmapper-event1.02.1:amd64 2:1.02.167-1ubuntu1 amd64 Linux Kernel Device Mapper event support library ii libdevmapper1.02.1:amd64 2:1.02.167-1ubuntu1 amd64 Linux Kernel Device Mapper userspace library ii libdrm-amdgpu1:amd64 2.4.107-8ubuntu1~20.04.2 amd64 Userspace interface to amdgpu-specific kernel DRM services -- runtime ii libdrm-common 2.4.107-8ubuntu1~20.04.2 all Userspace interface to kernel DRM services -- common files ii libdrm-intel1:amd64 2.4.107-8ubuntu1~20.04.2 amd64 Userspace interface to intel-specific kernel DRM services -- runtime ii libdrm-nouveau2:amd64 2.4.107-8ubuntu1~20.04.2 amd64 Userspace interface to nouveau-specific kernel DRM services -- runtime ii libdrm-radeon1:amd64 2.4.107-8ubuntu1~20.04.2 amd64 Userspace interface to radeon-specific kernel DRM services -- runtime ii libdrm2:amd64 2.4.107-8ubuntu1~20.04.2 amd64 Userspace interface to kernel DRM services -- runtime ii linux-firmware 1.187.29 all Firmware for Linux kernel drivers hi linux-generic 5.4.0.105.109 amd64 Complete Generic Linux kernel and headers ii linux-headers-5.4.0-104 5.4.0-104.118 all Header files related to Linux kernel version 5.4.0 ii linux-headers-5.4.0-104-generic 5.4.0-104.118 amd64 Linux kernel headers for version 5.4.0 on 64 bit x86 SMP ii linux-headers-5.4.0-105 5.4.0-105.119 all Header files related to Linux kernel version 5.4.0 ii linux-headers-5.4.0-105-generic 5.4.0-105.119 amd64 Linux kernel headers for version 5.4.0 on 64 bit x86 SMP hi linux-headers-generic 5.4.0.105.109 amd64 Generic Linux kernel headers ii linux-image-5.4.0-104-generic 5.4.0-104.118 amd64 Signed kernel image generic ii linux-image-5.4.0-105-generic 5.4.0-105.119 amd64 Signed kernel image generic hi linux-image-generic 5.4.0.105.109 amd64 Generic Linux kernel image ii linux-libc-dev:amd64 5.4.0-105.119 amd64 Linux Kernel Headers for development ii linux-modules-5.4.0-104-generic 5.4.0-104.118 amd64 Linux kernel extra modules for version 5.4.0 on 64 bit x86 SMP ii linux-modules-5.4.0-105-generic 5.4.0-105.119 amd64 Linux kernel extra modules for version 5.4.0 on 64 bit x86 SMP ii linux-modules-extra-5.4.0-104-generic 5.4.0-104.118 amd64 Linux kernel extra modules for version 5.4.0 on 64 bit x86 SMP ii linux-modules-extra-5.4.0-105-generic 5.4.0-105.119 amd64 Linux kernel extra modules for version 5.4.0 on 64 bit x86 SMP ii mlnx-ofed-kernel-dkms 5.4-OFED.5.4.3.0.3.1 all DKMS support for mlnx-ofed kernel modules ii mlnx-ofed-kernel-utils 5.4-OFED.5.4.3.0.3.1 amd64 Userspace tools to restart and tune mlnx-ofed kernel modules ii mlnx-tools 5.2.0-0.54303 amd64 Userspace tools to restart and tune MLNX_OFED kernel modules ii nvidia-kernel-common-470 470.103.01-0ubuntu0.20.04.1 amd64 Shared files used with the kernel module ii nvidia-kernel-source-470 470.103.01-0ubuntu0.20.04.1 amd64 NVIDIA kernel source package ii nvidia-peer-memory 1.2-0 all nvidia peer memory kernel module. ii nvidia-peer-memory-dkms 1.2-0 all DKMS support for nvidia-peer-memory kernel modules ii rsyslog 8.2001.0-1ubuntu1.1 amd64 reliable system and kernel logging daemon ii srp-dkms 5.4-OFED.5.4.3.0.1.1 all DKMS support fo srp kernel modules코드블록. 커널, IB 관련 패키지 버전 확인 - apt-mark 명령어를 사용하여 패키지 업데이트를 hold하세요.배경색 변경
# apt-mark hold <패키지이름># apt-mark hold <패키지이름>코드블록. 패키지 업데이트 hold
Intel E810 드라이버 업데이트
Intel E810 드라이버의 버전을 확인하고, 권장 버전으로 업데이트하세요.
- 서버 제조사 Intel E810 드라이버 권장 버전: 1.15.4
- 드라이버(ice-1.15.4.tar.gz) 다운로드
드라이버 업데이트 방법은 다음과 같습니다.
- 기본 드라이버 tar 파일을 원하는 디렉토리로 이동합니다.
예시: /home/username/ice 또는 /usr/local/src/ice
Archiver 파일을 untar / unzip하세요.
- x.x.x는 드라이버 tar 파일의 버전 번호입니다.배경색 변경
tar zxf ice-x.x.x.tar.gztar zxf ice-x.x.x.tar.gz코드블록. 압축 파일 해제
- x.x.x는 드라이버 tar 파일의 버전 번호입니다.
드라이버 src 디렉토리로 변경하세요.
- x.x.x는 드라이버 tar 파일의 버전 번호입니다.배경색 변경
cd ice-x.x.x/src/cd ice-x.x.x/src/코드블록. 디렉토리 변경
- x.x.x는 드라이버 tar 파일의 버전 번호입니다.
드라이버 모듈을 컴파일하세요.
배경색 변경make installmake install코드블록. 드라이버 모듈 컴파일 업데이트가 끝난 후, 버전을 확인하세요.
배경색 변경lsmod | grep ice modinfo ice | grep versionlsmod | grep ice modinfo ice | grep version코드블록. 버전 확인
NVIDIA driver 확인
nvidia-smi topo, IB nv_peer_mem status 확인NVIDIA driver를 확인(nvidia-smi topo, IB nv_peer_mem status)하여 IaaS HW 레벨을 점검하려면 다음 절차를 따르세요.
GPU 드라이버와 HW 상태를 확인하세요.
배경색 변경user@bm-dev-001:~$ nvidia-smi topo -m GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx5_0 mlx5_1 mlx5_2 mlx5_3 CPU Affinity NUMA Affinity GPU0 X NV12 NV12 NV12 NV12 NV12 NV12 NV12 SYS PXB SYS SYS 48-63 3 GPU1 NV12 X NV12 NV12 NV12 NV12 NV12 NV12 SYS PXB SYS SYS 48-63 3 GPU2 NV12 NV12 X NV12 NV12 NV12 NV12 NV12 PXB SYS SYS SYS 16-31 1 GPU3 NV12 NV12 NV12 X NV12 NV12 NV12 NV12 PXB SYS SYS SYS 16-31 1 GPU4 NV12 NV12 NV12 NV12 X NV12 NV12 NV12 SYS SYS SYS PXB 112-127 7 GPU5 NV12 NV12 NV12 NV12 NV12 X NV12 NV12 SYS SYS SYS PXB 112-127 7 GPU6 NV12 NV12 NV12 NV12 NV12 NV12 X NV12 SYS SYS PXB SYS 80-95 5 GPU7 NV12 NV12 NV12 NV12 NV12 NV12 NV12 X SYS SYS PXB SYS 80-95 5 mlx5_0 SYS SYS PXB PXB SYS SYS SYS SYS X SYS SYS SYS mlx5_1 PXB PXB SYS SYS SYS SYS SYS SYS SYS X SYS SYS mlx5_2 SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS X SYS mlx5_3 SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS X Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinksuser@bm-dev-001:~$ nvidia-smi topo -m GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx5_0 mlx5_1 mlx5_2 mlx5_3 CPU Affinity NUMA Affinity GPU0 X NV12 NV12 NV12 NV12 NV12 NV12 NV12 SYS PXB SYS SYS 48-63 3 GPU1 NV12 X NV12 NV12 NV12 NV12 NV12 NV12 SYS PXB SYS SYS 48-63 3 GPU2 NV12 NV12 X NV12 NV12 NV12 NV12 NV12 PXB SYS SYS SYS 16-31 1 GPU3 NV12 NV12 NV12 X NV12 NV12 NV12 NV12 PXB SYS SYS SYS 16-31 1 GPU4 NV12 NV12 NV12 NV12 X NV12 NV12 NV12 SYS SYS SYS PXB 112-127 7 GPU5 NV12 NV12 NV12 NV12 NV12 X NV12 NV12 SYS SYS SYS PXB 112-127 7 GPU6 NV12 NV12 NV12 NV12 NV12 NV12 X NV12 SYS SYS PXB SYS 80-95 5 GPU7 NV12 NV12 NV12 NV12 NV12 NV12 NV12 X SYS SYS PXB SYS 80-95 5 mlx5_0 SYS SYS PXB PXB SYS SYS SYS SYS X SYS SYS SYS mlx5_1 PXB PXB SYS SYS SYS SYS SYS SYS SYS X SYS SYS mlx5_2 SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS X SYS mlx5_3 SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS X Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks코드블록. GPU 드라이버 및 HW 상태 확인 NVSwitch HW 상태를 확인하세요.
배경색 변경user@bm-dev-001:~$ nvidia-smi nvlink --status GPU 0: NVIDIA A100-SXM4-80GB (UUID: GPU-2c0d1d6b-e348-55fc-44cf-cd65a954b36c) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 1: NVIDIA A100-SXM4-80GB (UUID: GPU-96f429d8-893a-a9ea-deca-feffd90669e9) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 2: NVIDIA A100-SXM4-80GB (UUID: GPU-2e601952-b442-b757-a035-725cd320f589) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 3: NVIDIA A100-SXM4-80GB (UUID: GPU-bcbfd885-a9f8-ec8c-045b-c521472b4fed) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 4: NVIDIA A100-SXM4-80GB (UUID: GPU-30273090-2d78-fc7a-a360-ec5f871dd488) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 5: NVIDIA A100-SXM4-80GB (UUID: GPU-5ce7ef61-56dd-fb18-aa7c-be610c8d51c3) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 6: NVIDIA A100-SXM4-80GB (UUID: GPU-740a527b-b286-8b85-35eb-b6b41c0bb6d7) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 7: NVIDIA A100-SXM4-80GB (UUID: GPU-1fb6de95-60f6-dbf2-ffca-f7680577e37c) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/suser@bm-dev-001:~$ nvidia-smi nvlink --status GPU 0: NVIDIA A100-SXM4-80GB (UUID: GPU-2c0d1d6b-e348-55fc-44cf-cd65a954b36c) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 1: NVIDIA A100-SXM4-80GB (UUID: GPU-96f429d8-893a-a9ea-deca-feffd90669e9) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 2: NVIDIA A100-SXM4-80GB (UUID: GPU-2e601952-b442-b757-a035-725cd320f589) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 3: NVIDIA A100-SXM4-80GB (UUID: GPU-bcbfd885-a9f8-ec8c-045b-c521472b4fed) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 4: NVIDIA A100-SXM4-80GB (UUID: GPU-30273090-2d78-fc7a-a360-ec5f871dd488) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 5: NVIDIA A100-SXM4-80GB (UUID: GPU-5ce7ef61-56dd-fb18-aa7c-be610c8d51c3) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 6: NVIDIA A100-SXM4-80GB (UUID: GPU-740a527b-b286-8b85-35eb-b6b41c0bb6d7) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 7: NVIDIA A100-SXM4-80GB (UUID: GPU-1fb6de95-60f6-dbf2-ffca-f7680577e37c) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s코드블록. NVSwitch HW 상태 확인 InfiniBand(IB) HCA 카드 HW 상태와 Link를 확인하세요.
배경색 변경user@bm-dev-001:~$ ibdev2netdev -v cat: /sys/class/infiniband/mlx5_0/device/vpd: Permission denied 0000:45:00.0 mlx5_0 (MT4123 - ) fw 20.29.1016 port 1 (ACTIVE) ==> ibs18 (Down) cat: /sys/class/infiniband/mlx5_1/device/vpd: Permission denied 0000:0e:00.0 mlx5_1 (MT4123 - ) fw 20.29.1016 port 1 (ACTIVE) ==> ibs17 (Down) cat: /sys/class/infiniband/mlx5_2/device/vpd: Permission denied 0000:c5:00.0 mlx5_2 (MT4123 - ) fw 20.29.1016 port 1 (ACTIVE) ==> ibs20 (Down) cat: /sys/class/infiniband/mlx5_3/device/vpd: Permission denied 0000:85:00.0 mlx5_3 (MT4123 - ) fw 20.29.1016 port 1 (ACTIVE) ==> ibs19 (Down) user@bm-dev-001:~$ root@bm-dev-001:~# ibstat CA 'mlx5_0' CA type: MT4123 Number of ports: 1 Firmware version: 20.29.1016 Hardware version: 0 Node GUID: 0x88e9a4ffff5060ac System image GUID: 0x88e9a4ffff5060ac Port 1: State: Active Physical state: LinkUp Rate: 200 Base lid: 8 LMC: 0 SM lid: 1 Capability mask: 0x2651e848 Port GUID: 0x88e9a4ffff5060ac Link layer: InfiniBand CA 'mlx5_1' CA type: MT4123 Number of ports: 1 Firmware version: 20.29.1016 Hardware version: 0 Node GUID: 0x88e9a4ffff504080 System image GUID: 0x88e9a4ffff504080 Port 1: State: Active Physical state: LinkUp Rate: 200 Base lid: 5 LMC: 0 SM lid: 1 Capability mask: 0x2651e848 Port GUID: 0x88e9a4ffff504080 Link layer: InfiniBand CA 'mlx5_2' CA type: MT4123 Number of ports: 1 Firmware version: 20.29.1016 Hardware version: 0 Node GUID: 0x88e9a4ffff505038 System image GUID: 0x88e9a4ffff505038 Port 1: State: Active Physical state: LinkUp Rate: 200 Base lid: 2 LMC: 0 SM lid: 1 Capability mask: 0x2651e848 Port GUID: 0x88e9a4ffff505038 Link layer: InfiniBand CA 'mlx5_3' CA type: MT4123 Number of ports: 1 Firmware version: 20.29.1016 Hardware version: 0 Node GUID: 0x88e9a4ffff504094 System image GUID: 0x88e9a4ffff504094 Port 1: State: Active Physical state: LinkUp Rate: 200 Base lid: 7 LMC: 0 SM lid: 1 Capability mask: 0x2651e848 Port GUID: 0x88e9a4ffff504094 Link layer: InfiniBanduser@bm-dev-001:~$ ibdev2netdev -v cat: /sys/class/infiniband/mlx5_0/device/vpd: Permission denied 0000:45:00.0 mlx5_0 (MT4123 - ) fw 20.29.1016 port 1 (ACTIVE) ==> ibs18 (Down) cat: /sys/class/infiniband/mlx5_1/device/vpd: Permission denied 0000:0e:00.0 mlx5_1 (MT4123 - ) fw 20.29.1016 port 1 (ACTIVE) ==> ibs17 (Down) cat: /sys/class/infiniband/mlx5_2/device/vpd: Permission denied 0000:c5:00.0 mlx5_2 (MT4123 - ) fw 20.29.1016 port 1 (ACTIVE) ==> ibs20 (Down) cat: /sys/class/infiniband/mlx5_3/device/vpd: Permission denied 0000:85:00.0 mlx5_3 (MT4123 - ) fw 20.29.1016 port 1 (ACTIVE) ==> ibs19 (Down) user@bm-dev-001:~$ root@bm-dev-001:~# ibstat CA 'mlx5_0' CA type: MT4123 Number of ports: 1 Firmware version: 20.29.1016 Hardware version: 0 Node GUID: 0x88e9a4ffff5060ac System image GUID: 0x88e9a4ffff5060ac Port 1: State: Active Physical state: LinkUp Rate: 200 Base lid: 8 LMC: 0 SM lid: 1 Capability mask: 0x2651e848 Port GUID: 0x88e9a4ffff5060ac Link layer: InfiniBand CA 'mlx5_1' CA type: MT4123 Number of ports: 1 Firmware version: 20.29.1016 Hardware version: 0 Node GUID: 0x88e9a4ffff504080 System image GUID: 0x88e9a4ffff504080 Port 1: State: Active Physical state: LinkUp Rate: 200 Base lid: 5 LMC: 0 SM lid: 1 Capability mask: 0x2651e848 Port GUID: 0x88e9a4ffff504080 Link layer: InfiniBand CA 'mlx5_2' CA type: MT4123 Number of ports: 1 Firmware version: 20.29.1016 Hardware version: 0 Node GUID: 0x88e9a4ffff505038 System image GUID: 0x88e9a4ffff505038 Port 1: State: Active Physical state: LinkUp Rate: 200 Base lid: 2 LMC: 0 SM lid: 1 Capability mask: 0x2651e848 Port GUID: 0x88e9a4ffff505038 Link layer: InfiniBand CA 'mlx5_3' CA type: MT4123 Number of ports: 1 Firmware version: 20.29.1016 Hardware version: 0 Node GUID: 0x88e9a4ffff504094 System image GUID: 0x88e9a4ffff504094 Port 1: State: Active Physical state: LinkUp Rate: 200 Base lid: 7 LMC: 0 SM lid: 1 Capability mask: 0x2651e848 Port GUID: 0x88e9a4ffff504094 Link layer: InfiniBand코드블록. InfiniBand(IB) HCA 카드 HW 상태 및 Link 확인
IB bandwidth 통신 확인
IB bandwidth 통신 상태를 확인(ib_send_bw)하여 IaaS HW 레벨을 점검하려면 다음 절차를 따르세요.
IB HCA 인터페이스의 이름을 확인하세요.
배경색 변경user@bm-dev-001:~$ ibdev2netdev mlx5_0 port 1 ==> ibs18 (Down) mlx5_1 port 1 ==> ibs17 (Down) mlx5_2 port 1 ==> ibs20 (Down) mlx5_3 port 1 ==> ibs19 (Down)user@bm-dev-001:~$ ibdev2netdev mlx5_0 port 1 ==> ibs18 (Down) mlx5_1 port 1 ==> ibs17 (Down) mlx5_2 port 1 ==> ibs20 (Down) mlx5_3 port 1 ==> ibs19 (Down)코드블록. IB HCA 인터페이스의 이름 확인 IB스위치#1과 통신이 가능한 HCA 인터페이스를 확인하세요.
배경색 변경mlx5_0 port 1 ==> ibs18 (Down) mlx5_2 port 1 ==> ibs20 (Down)mlx5_0 port 1 ==> ibs18 (Down) mlx5_2 port 1 ==> ibs20 (Down)코드블록. HCA 인터페이스 확인 IB스위치#2와 통신이 가능한 HCA 인터페이스를 확인하세요.
배경색 변경mlx5_1 port 1 ==> ibs17 (Down) mlx5_3 port 1 ==> ibs19 (Down)mlx5_1 port 1 ==> ibs17 (Down) mlx5_3 port 1 ==> ibs19 (Down)코드블록. HCA 인터페이스 확인 SERVER Side 명령어를 사용하여 상호 통신 상태를 확인하세요.
Client Side명령어를 2차로 입력하여 상호 통신배경색 변경user@bm-dev-001:~$ ib_send_bw -d mlx5_3 -i 1 –F ************************************ * Waiting for client to connect... * ************************************ --------------------------------------------------------------------------------------- Send BW Test Dual-port : OFF Device : mlx5_3 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON ibv_wr* API : ON RX depth : 512 CQ Moderation : 1 Mtu : 4096[B] Link type : IB Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0x07 QPN 0x002e PSN 0xa86622 remote address: LID 0x0a QPN 0x002d PSN 0xfc58dd --------------------------------------------------------------------------------------- #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps] 65536 1000 0.00 19827.40 0.317238 ---------------------------------------------------------------------------------------user@bm-dev-001:~$ ib_send_bw -d mlx5_3 -i 1 –F ************************************ * Waiting for client to connect... * ************************************ --------------------------------------------------------------------------------------- Send BW Test Dual-port : OFF Device : mlx5_3 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON ibv_wr* API : ON RX depth : 512 CQ Moderation : 1 Mtu : 4096[B] Link type : IB Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0x07 QPN 0x002e PSN 0xa86622 remote address: LID 0x0a QPN 0x002d PSN 0xfc58dd --------------------------------------------------------------------------------------- #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps] 65536 1000 0.00 19827.40 0.317238 ---------------------------------------------------------------------------------------코드블록. 통신 상태 확인
CLIENT Side명령어를 사용하여 상호 통신 상태를 확인하세요.SERVER Side명령어를 1차로 입력하여 상호 통신배경색 변경root@bm-dev-003:~# ib_send_bw -d mlx5_3 -i 1 -F <SERVER Side IP> --------------------------------------------------------------------------------------- Send BW Test Dual-port : OFF Device : mlx5_3 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON ibv_wr* API : ON TX depth : 128 CQ Moderation : 1 Mtu : 4096[B] Link type : IB Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0x0a QPN 0x002a PSN 0x98a48e remote address: LID 0x07 QPN 0x002c PSN 0xe68304 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps] 65536 1000 19008.49 19006.37 0.304102 ---------------------------------------------------------------------------------------root@bm-dev-003:~# ib_send_bw -d mlx5_3 -i 1 -F <SERVER Side IP> --------------------------------------------------------------------------------------- Send BW Test Dual-port : OFF Device : mlx5_3 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON ibv_wr* API : ON TX depth : 128 CQ Moderation : 1 Mtu : 4096[B] Link type : IB Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0x0a QPN 0x002a PSN 0x98a48e remote address: LID 0x07 QPN 0x002c PSN 0xe68304 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps] 65536 1000 19008.49 19006.37 0.304102 ---------------------------------------------------------------------------------------코드블록. 통신 상태 확인
IB 서비스 유관 커널 모듈 확인
IB 서비스 유관 커널 모듈을 확인(lsmod)하여 IaaS HW 레벨을 점검하세요.
user@bm-dev-001:~$ lsmod | grep nv_peer_mem
nv_peer_mem 16384 0
ib_core 315392 9 rdma_cm,ib_ipoib,nv_peer_mem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
nvidia 35315712 156 nvidia_uvm,nv_peer_mem,nvidia_modesetuser@bm-dev-001:~$ lsmod | grep nv_peer_mem
nv_peer_mem 16384 0
ib_core 315392 9 rdma_cm,ib_ipoib,nv_peer_mem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
nvidia 35315712 156 nvidia_uvm,nv_peer_mem,nvidia_modesetuser@bm-dev-001:~$ service nv_peer_mem status
nv_peer_mem.service - LSB: Activates/Deactivates nv_peer_mem to \ start at boot time.
Loaded: loaded (/etc/init.d/nv_peer_mem; generated)
Active: active (exited) since Mon 2023-03-13 16:21:33 KST; 2 days ago
Docs: man:systemd-sysv-generator(8)
Process: 4913 ExecStart=/etc/init.d/nv_peer_mem start (code=exited, status=0/SUCCESS)user@bm-dev-001:~$ service nv_peer_mem status
nv_peer_mem.service - LSB: Activates/Deactivates nv_peer_mem to \ start at boot time.
Loaded: loaded (/etc/init.d/nv_peer_mem; generated)
Active: active (exited) since Mon 2023-03-13 16:21:33 KST; 2 days ago
Docs: man:systemd-sysv-generator(8)
Process: 4913 ExecStart=/etc/init.d/nv_peer_mem start (code=exited, status=0/SUCCESS)user@bm-dev-001:~$ lsmod | grep ib
libiscsi_tcp 32768 1 iscsi_tcp
libiscsi 57344 2 libiscsi_tcp,iscsi_tcp
scsi_transport_iscsi 110592 4 libiscsi_tcp,iscsi_tcp,libiscsi
ib_ipoib 131072 0
ib_cm 57344 2 rdma_cm,ib_ipoib
ib_umad 24576 8
mlx5_ib 380928 0
ib_uverbs 135168 18 rdma_ucm,mlx5_ib
ib_core 315392 9 rdma_cm,ib_ipoib,nv_peer_mem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
libcrc32c 16384 2 btrfs,raid456
mlx5_core 1458176 1 mlx5_ib
auxiliary 16384 2 mlx5_ib,mlx5_core
mlx_compat 65536 12 rdma_cm,ib_ipoib,mlxdevm,iw_cm,auxiliary,ib_umad,ib_core,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm,mlx5_coreuser@bm-dev-001:~$ lsmod | grep ib
libiscsi_tcp 32768 1 iscsi_tcp
libiscsi 57344 2 libiscsi_tcp,iscsi_tcp
scsi_transport_iscsi 110592 4 libiscsi_tcp,iscsi_tcp,libiscsi
ib_ipoib 131072 0
ib_cm 57344 2 rdma_cm,ib_ipoib
ib_umad 24576 8
mlx5_ib 380928 0
ib_uverbs 135168 18 rdma_ucm,mlx5_ib
ib_core 315392 9 rdma_cm,ib_ipoib,nv_peer_mem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
libcrc32c 16384 2 btrfs,raid456
mlx5_core 1458176 1 mlx5_ib
auxiliary 16384 2 mlx5_ib,mlx5_core
mlx_compat 65536 12 rdma_cm,ib_ipoib,mlxdevm,iw_cm,auxiliary,ib_umad,ib_core,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm,mlx5_core스토리지 물리 디스크 자원 및 Multi-Path 확인
스토리지 물리 디스크 자원 및 Multi-Path를 확인하여 IaaS HW 레벨을 점검하세요.
root@bm-dev-002:/tmp# fdisk –lroot@bm-dev-002:/tmp# fdisk –lroot@bm-dev-002:/tmp# multipath –llroot@bm-dev-002:/tmp# multipath –llMulti-node GPU Cluster 신규 배포 후 Service Network 확인
다음 명령어를 이용하여 Bonding 및 Slave Interface의 MII Status가 up인지 확인하세요.
명령어
배경색 변경root@mngc-001:~# cat /proc/net/bonding/bond-srv Ethernet Channel Bonding Driver: v5.15.0-25-genericroot@mngc-001:~# cat /proc/net/bonding/bond-srv Ethernet Channel Bonding Driver: v5.15.0-25-generic코드블록. Service Network 확인 명령어 확인 결과
배경색 변경Bonding Mode: fault-tolerance (active-backup) Primary Slave: None Currently Active Slave: ens9f0 MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 0 Down Delay (ms): 0 Peer Notification Delay (ms): 0 Slave Interface: ens9f0 MII Status: up Speed: 100000 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: 30:3e:a7:02:35:70 Slave queue ID: 0 Slave Interface: ens11f0 MII Status: up Speed: 100000 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: 30:3e:a7:02:2f:e8 Slave queue ID: 0Bonding Mode: fault-tolerance (active-backup) Primary Slave: None Currently Active Slave: ens9f0 MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 0 Down Delay (ms): 0 Peer Notification Delay (ms): 0 Slave Interface: ens9f0 MII Status: up Speed: 100000 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: 30:3e:a7:02:35:70 Slave queue ID: 0 Slave Interface: ens11f0 MII Status: up Speed: 100000 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: 30:3e:a7:02:2f:e8 Slave queue ID: 0코드블록. Service Network 확인 명령어 확인 결과
Multi-node GPU Cluster 신규 배포 후 Time Server와 시간 동기화 확인
OS 이미지에는 chrony 데몬 설치 및 SCP NTP 서버 동기화가 설정되어 있습니다. 다음 명령어를 이용하여 MS Name 열에 ^*로 표기된 라인이 있는지 확인하세요.
명령어
배경색 변경root@mngc-001:~# chronyc sources -Vroot@mngc-001:~# chronyc sources -V코드블록. chrony 데몬 설치 명령어 확인 결과
배경색 변경MS Name/IP address Stratum Poll Reach LastRx Last sample =============================================================================== ^+ 198.19.1.53 4 10 377 1040 -16us[ -37us] +/- 9982us ^* 198.19.1.54 4 10 377 312 -367us[ -388us] +/- 13msMS Name/IP address Stratum Poll Reach LastRx Last sample =============================================================================== ^+ 198.19.1.53 4 10 377 1040 -16us[ -37us] +/- 9982us ^* 198.19.1.54 4 10 377 312 -367us[ -388us] +/- 13ms코드블록. chrony 데몬 설치 확인 결과
GPU MIG/ECC 설정 초기화 점검 가이드
Multi-node GPU Cluster 상품 신청 시 GPU MIG/ECC 설정을 초기화합니다. 하지만 정확한 설정값을 적용하기 위하여 최초 한 번 리부팅을 진행한 후, 점검 가이드에 따라 설정값 적용 여부를 직접 확인하고 사용해주시기 바랍니다.
- MIG: Multi-Instance GPU
- ECC: Error Correction Code
MIG 설정 초기화
MIG 설정값을 확인하고 초기화하는 방법은 다음을 참조하세요.
다음 명령어를 사용하여 MIG M.의 상태값이 Disabled인지 확인하세요.
명령어
배경색 변경root@bm-dev-001:~#nvidia-smiroot@bm-dev-001:~#nvidia-smi코드블록. MIG M. 설정 초기화 확인 결과
배경색 변경+-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 470.129.06 Driver version: 470.129.06 CUDA Version: 11.4 | |----------------------------------+-----------------------------+------------------------| | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |==================================+=============================+========================| | 0 NVIDIA A100-SXM... Off | 00000000:03:00.0 Off | Off | | N/A 29C P0 57W / 400W | 0MiB / 81251MiB | 0% Default | | | | Disabled | +----------------------------------+-----------------------------+------------------------+ | 0 NVIDIA A100-SXM... Off | 00000000:0C:00.0 Off | Off | | N/A 30C P0 58W / 400W | 0MiB / 81251MiB | 18% Default | | | | Disabled | +-----------------------------------------------------------------------------------------++-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 470.129.06 Driver version: 470.129.06 CUDA Version: 11.4 | |----------------------------------+-----------------------------+------------------------| | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |==================================+=============================+========================| | 0 NVIDIA A100-SXM... Off | 00000000:03:00.0 Off | Off | | N/A 29C P0 57W / 400W | 0MiB / 81251MiB | 0% Default | | | | Disabled | +----------------------------------+-----------------------------+------------------------+ | 0 NVIDIA A100-SXM... Off | 00000000:0C:00.0 Off | Off | | N/A 30C P0 58W / 400W | 0MiB / 81251MiB | 18% Default | | | | Disabled | +-----------------------------------------------------------------------------------------+코드블록. MIG M. 설정 초기화 확인 결과 MIG M.의 상태값이 Disabled이 아닐 경우, 다음 명령어를 사용하여 MIG를 초기화하세요.
배경색 변경root@bm-dev-001:~# nvidia-smi -mig 0 root@bm-dev-001:~# nvidia-smi --gpu-resetroot@bm-dev-001:~# nvidia-smi -mig 0 root@bm-dev-001:~# nvidia-smi --gpu-reset코드블록. MIG M. 상태값 초기화
ECC 설정 초기화
ECC 설정값을 확인하고 초기화하는 방법은 다음을 참조하세요.
다음 명령어를 사용하여 Volatile Uncorr. ECC의 상태값이 Off인지 확인하세요.
명령어
배경색 변경root@bm-dev-001:~#nvidia-smiroot@bm-dev-001:~#nvidia-smi코드블록. ECC 설정 명령어 확인 결과
배경색 변경+-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 470.129.06 Driver version: 470.129.06 CUDA Version: 11.4 | |----------------------------------+-----------------------------+------------------------| | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |==================================+=============================+========================| | 0 NVIDIA A100-SXM... Off | 00000000:03:00.0 Off | Off | | N/A 29C P0 57W / 400W | 0MiB / 81251MiB | 0% Default | | | | Disabled | +----------------------------------+-----------------------------+------------------------+ | 0 NVIDIA A100-SXM... Off | 00000000:0C:00.0 Off | Off | | N/A 30C P0 61W / 400W | 0MiB / 81251MiB | 18% Default | | | | Disabled | +-----------------------------------------------------------------------------------------++-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 470.129.06 Driver version: 470.129.06 CUDA Version: 11.4 | |----------------------------------+-----------------------------+------------------------| | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |==================================+=============================+========================| | 0 NVIDIA A100-SXM... Off | 00000000:03:00.0 Off | Off | | N/A 29C P0 57W / 400W | 0MiB / 81251MiB | 0% Default | | | | Disabled | +----------------------------------+-----------------------------+------------------------+ | 0 NVIDIA A100-SXM... Off | 00000000:0C:00.0 Off | Off | | N/A 30C P0 61W / 400W | 0MiB / 81251MiB | 18% Default | | | | Disabled | +-----------------------------------------------------------------------------------------+코드블록. ECC 설정 확인 결과 Volatile Uncorr. ECC의 상태값이 On*일 경우, 리부팅을 진행하세요.
Volatile Uncorr. ECC의 상태값이 On*이나 Off가 아닐 경우, 다음 명령어를 사용하여 ECC를 초기화하세요. 초기화가 끝나면 리부팅 후, 상태값이 Off인지 확인하세요.
root@bm-dev-001:~# nvidia-smi --ecc-config=0root@bm-dev-001:~# nvidia-smi --ecc-config=0