Multi-node GPU Cluster 서비스 범위 및 점검 가이드

Multi-node GPU Cluster 서비스 범위

Multi-node GPU Cluster 서비스의 IaaS HW 레벨 문제 발생 시 Support Center의 문의하기를 통해 기술 지원을 받을 수 있습니다. 하지만 OS Kernel 업데이트 또는 애플리케이션 설치 등의 변경에 따른 리스크는 사용자의 영역이므로 기술 지원이 어려우니, 시스템 업데이트 등의 작업에 유의해 주시기 바랍니다.

IaaS HW 레벨 문제

  • IPMI(iLO) HW모니터링 콘솔에서 발생하는 서버 내의 HW fault event 발생 메시지
  • nvdia-smi 명령에서 확인되는 GPU HW 동작 오류
  • InfiniBand HCA 카드 또는 InfiniBand Switch 점검에서 발생하는 HW 오류 메시지
주의
Multi-node GPU Cluster 는 Ubuntu OS / NVDIA / Infiniband 의 Software Version 호환성에 민감한 서비스이므로, 사용자의 OS Kernel 업데이트 또는 애플리케이션 설치 등의 변경 이후 공식적인 기술지원이 불가합니다.

IaaS HW 점검 가이드

Multi-node GPU Cluster 서비스를 신청한 후에는 점검 가이드에 따라 IaaS HW 레벨을 점검하는 것을 권장합니다.

OS Kernel 및 Package holding

안내
  • 패키지 버전의 자동 업데이트를 원하지 않을 경우, apt-mark 명령어로 패키지 업데이트를 차단하는 것을 권장합니다.
  • Linux 커널이나 IB 관련된 패키지 버전의 업데이트 차단을 권장합니다.

OS Kernel 및 Package holding을 진행하려면 다음 절차를 따르세요.

  1. 다음 명령어를 사용하여 커널과 IB 관련된 패키지 버전을 확인하세요.
    배경색 변경
    root@bm-dev-001:~# dpkg -l | egrep -i "kernel | mlnx"
    root@bm-dev-001:~# dpkg -l | egrep -i "kernel | nvidia"
    root@bm-dev-001:~# dpkg -l | egrep -i "kernel | linux-image"
    ii  crash                                 7.2.8-1ubuntu1.20.04.1                  amd64        kernel debugging utility, allowing gdb like syntax
    ii  dkms                                  2.8.1-5ubuntu2                          all          Dynamic Kernel Module Support Framework
    ii  dmeventd                              2:1.02.167-1ubuntu1                     amd64        Linux Kernel Device Mapper event daemon
    ii  dmsetup                               2:1.02.167-1ubuntu1                     amd64        Linux Kernel Device Mapper userspace library
    ii  iser-dkms                             5.4-OFED.5.4.3.0.1.1                    all          DKMS support fo iser kernel modules
    ii  isert-dkms                            5.4-OFED.5.4.3.0.1.1                    all          DKMS support fo isert kernel modules
    ii  kernel-mft-dkms                       4.17.2-12                               all          DKMS support for kernel-mft kernel modules
    ii  kmod                                  27-1ubuntu2                             amd64        tools for managing Linux kernel modules
    ii  knem                                  1.1.4.90mlnx1-OFED.5.1.2.5.0.1          amd64        userspace tools for the KNEM kernel module
    ii  knem-dkms                             1.1.4.90mlnx1-OFED.5.1.2.5.0.1          all          DKMS support for mlnx-ofed kernel modules
    ii  libaio1:amd64                         0.3.112-5                               amd64        Linux kernel AIO access library - shared library
    ii  libdevmapper-event1.02.1:amd64        2:1.02.167-1ubuntu1                     amd64        Linux Kernel Device Mapper event support library
    ii  libdevmapper1.02.1:amd64              2:1.02.167-1ubuntu1                     amd64        Linux Kernel Device Mapper userspace library
    ii  libdrm-amdgpu1:amd64                  2.4.107-8ubuntu1~20.04.2                amd64        Userspace interface to amdgpu-specific kernel DRM services -- runtime
    ii  libdrm-common                         2.4.107-8ubuntu1~20.04.2                all          Userspace interface to kernel DRM services -- common files
    ii  libdrm-intel1:amd64                   2.4.107-8ubuntu1~20.04.2                amd64        Userspace interface to intel-specific kernel DRM services -- runtime
    ii  libdrm-nouveau2:amd64                 2.4.107-8ubuntu1~20.04.2                amd64        Userspace interface to nouveau-specific kernel DRM services -- runtime
    ii  libdrm-radeon1:amd64                  2.4.107-8ubuntu1~20.04.2                amd64        Userspace interface to radeon-specific kernel DRM services -- runtime
    ii  libdrm2:amd64                         2.4.107-8ubuntu1~20.04.2                amd64        Userspace interface to kernel DRM services -- runtime
    ii  linux-firmware                        1.187.29                                all          Firmware for Linux kernel drivers
    hi  linux-generic                         5.4.0.105.109                           amd64        Complete Generic Linux kernel and headers
    ii  linux-headers-5.4.0-104               5.4.0-104.118                           all          Header files related to Linux kernel version 5.4.0
    ii  linux-headers-5.4.0-104-generic       5.4.0-104.118                           amd64        Linux kernel headers for version 5.4.0 on 64 bit x86 SMP
    ii  linux-headers-5.4.0-105               5.4.0-105.119                           all          Header files related to Linux kernel version 5.4.0
    ii  linux-headers-5.4.0-105-generic       5.4.0-105.119                           amd64        Linux kernel headers for version 5.4.0 on 64 bit x86 SMP
    hi  linux-headers-generic                 5.4.0.105.109                           amd64        Generic Linux kernel headers
    ii  linux-image-5.4.0-104-generic         5.4.0-104.118                           amd64        Signed kernel image generic
    ii  linux-image-5.4.0-105-generic         5.4.0-105.119                           amd64        Signed kernel image generic
    hi  linux-image-generic                   5.4.0.105.109                           amd64        Generic Linux kernel image
    ii  linux-libc-dev:amd64                  5.4.0-105.119                           amd64        Linux Kernel Headers for development
    ii  linux-modules-5.4.0-104-generic       5.4.0-104.118                           amd64        Linux kernel extra modules for version 5.4.0 on 64 bit x86 SMP
    ii  linux-modules-5.4.0-105-generic       5.4.0-105.119                           amd64        Linux kernel extra modules for version 5.4.0 on 64 bit x86 SMP
    ii  linux-modules-extra-5.4.0-104-generic 5.4.0-104.118                           amd64        Linux kernel extra modules for version 5.4.0 on 64 bit x86 SMP
    ii  linux-modules-extra-5.4.0-105-generic 5.4.0-105.119                           amd64        Linux kernel extra modules for version 5.4.0 on 64 bit x86 SMP
    ii  mlnx-ofed-kernel-dkms                 5.4-OFED.5.4.3.0.3.1                    all          DKMS support for mlnx-ofed kernel modules
    ii  mlnx-ofed-kernel-utils                5.4-OFED.5.4.3.0.3.1                    amd64        Userspace tools to restart and tune mlnx-ofed kernel modules
    ii  mlnx-tools                            5.2.0-0.54303                           amd64        Userspace tools to restart and tune MLNX_OFED kernel modules
    ii  nvidia-kernel-common-470              470.103.01-0ubuntu0.20.04.1             amd64        Shared files used with the kernel module
    ii  nvidia-kernel-source-470              470.103.01-0ubuntu0.20.04.1             amd64        NVIDIA kernel source package
    ii  nvidia-peer-memory                    1.2-0                                   all          nvidia peer memory kernel module.
    ii  nvidia-peer-memory-dkms               1.2-0                                   all          DKMS support for nvidia-peer-memory kernel modules
    ii  rsyslog                               8.2001.0-1ubuntu1.1                     amd64        reliable system and kernel logging daemon
    ii  srp-dkms                              5.4-OFED.5.4.3.0.1.1                    all          DKMS support fo srp kernel modules
    root@bm-dev-001:~# dpkg -l | egrep -i "kernel | mlnx"
    root@bm-dev-001:~# dpkg -l | egrep -i "kernel | nvidia"
    root@bm-dev-001:~# dpkg -l | egrep -i "kernel | linux-image"
    ii  crash                                 7.2.8-1ubuntu1.20.04.1                  amd64        kernel debugging utility, allowing gdb like syntax
    ii  dkms                                  2.8.1-5ubuntu2                          all          Dynamic Kernel Module Support Framework
    ii  dmeventd                              2:1.02.167-1ubuntu1                     amd64        Linux Kernel Device Mapper event daemon
    ii  dmsetup                               2:1.02.167-1ubuntu1                     amd64        Linux Kernel Device Mapper userspace library
    ii  iser-dkms                             5.4-OFED.5.4.3.0.1.1                    all          DKMS support fo iser kernel modules
    ii  isert-dkms                            5.4-OFED.5.4.3.0.1.1                    all          DKMS support fo isert kernel modules
    ii  kernel-mft-dkms                       4.17.2-12                               all          DKMS support for kernel-mft kernel modules
    ii  kmod                                  27-1ubuntu2                             amd64        tools for managing Linux kernel modules
    ii  knem                                  1.1.4.90mlnx1-OFED.5.1.2.5.0.1          amd64        userspace tools for the KNEM kernel module
    ii  knem-dkms                             1.1.4.90mlnx1-OFED.5.1.2.5.0.1          all          DKMS support for mlnx-ofed kernel modules
    ii  libaio1:amd64                         0.3.112-5                               amd64        Linux kernel AIO access library - shared library
    ii  libdevmapper-event1.02.1:amd64        2:1.02.167-1ubuntu1                     amd64        Linux Kernel Device Mapper event support library
    ii  libdevmapper1.02.1:amd64              2:1.02.167-1ubuntu1                     amd64        Linux Kernel Device Mapper userspace library
    ii  libdrm-amdgpu1:amd64                  2.4.107-8ubuntu1~20.04.2                amd64        Userspace interface to amdgpu-specific kernel DRM services -- runtime
    ii  libdrm-common                         2.4.107-8ubuntu1~20.04.2                all          Userspace interface to kernel DRM services -- common files
    ii  libdrm-intel1:amd64                   2.4.107-8ubuntu1~20.04.2                amd64        Userspace interface to intel-specific kernel DRM services -- runtime
    ii  libdrm-nouveau2:amd64                 2.4.107-8ubuntu1~20.04.2                amd64        Userspace interface to nouveau-specific kernel DRM services -- runtime
    ii  libdrm-radeon1:amd64                  2.4.107-8ubuntu1~20.04.2                amd64        Userspace interface to radeon-specific kernel DRM services -- runtime
    ii  libdrm2:amd64                         2.4.107-8ubuntu1~20.04.2                amd64        Userspace interface to kernel DRM services -- runtime
    ii  linux-firmware                        1.187.29                                all          Firmware for Linux kernel drivers
    hi  linux-generic                         5.4.0.105.109                           amd64        Complete Generic Linux kernel and headers
    ii  linux-headers-5.4.0-104               5.4.0-104.118                           all          Header files related to Linux kernel version 5.4.0
    ii  linux-headers-5.4.0-104-generic       5.4.0-104.118                           amd64        Linux kernel headers for version 5.4.0 on 64 bit x86 SMP
    ii  linux-headers-5.4.0-105               5.4.0-105.119                           all          Header files related to Linux kernel version 5.4.0
    ii  linux-headers-5.4.0-105-generic       5.4.0-105.119                           amd64        Linux kernel headers for version 5.4.0 on 64 bit x86 SMP
    hi  linux-headers-generic                 5.4.0.105.109                           amd64        Generic Linux kernel headers
    ii  linux-image-5.4.0-104-generic         5.4.0-104.118                           amd64        Signed kernel image generic
    ii  linux-image-5.4.0-105-generic         5.4.0-105.119                           amd64        Signed kernel image generic
    hi  linux-image-generic                   5.4.0.105.109                           amd64        Generic Linux kernel image
    ii  linux-libc-dev:amd64                  5.4.0-105.119                           amd64        Linux Kernel Headers for development
    ii  linux-modules-5.4.0-104-generic       5.4.0-104.118                           amd64        Linux kernel extra modules for version 5.4.0 on 64 bit x86 SMP
    ii  linux-modules-5.4.0-105-generic       5.4.0-105.119                           amd64        Linux kernel extra modules for version 5.4.0 on 64 bit x86 SMP
    ii  linux-modules-extra-5.4.0-104-generic 5.4.0-104.118                           amd64        Linux kernel extra modules for version 5.4.0 on 64 bit x86 SMP
    ii  linux-modules-extra-5.4.0-105-generic 5.4.0-105.119                           amd64        Linux kernel extra modules for version 5.4.0 on 64 bit x86 SMP
    ii  mlnx-ofed-kernel-dkms                 5.4-OFED.5.4.3.0.3.1                    all          DKMS support for mlnx-ofed kernel modules
    ii  mlnx-ofed-kernel-utils                5.4-OFED.5.4.3.0.3.1                    amd64        Userspace tools to restart and tune mlnx-ofed kernel modules
    ii  mlnx-tools                            5.2.0-0.54303                           amd64        Userspace tools to restart and tune MLNX_OFED kernel modules
    ii  nvidia-kernel-common-470              470.103.01-0ubuntu0.20.04.1             amd64        Shared files used with the kernel module
    ii  nvidia-kernel-source-470              470.103.01-0ubuntu0.20.04.1             amd64        NVIDIA kernel source package
    ii  nvidia-peer-memory                    1.2-0                                   all          nvidia peer memory kernel module.
    ii  nvidia-peer-memory-dkms               1.2-0                                   all          DKMS support for nvidia-peer-memory kernel modules
    ii  rsyslog                               8.2001.0-1ubuntu1.1                     amd64        reliable system and kernel logging daemon
    ii  srp-dkms                              5.4-OFED.5.4.3.0.1.1                    all          DKMS support fo srp kernel modules
    코드블록. 커널, IB 관련 패키지 버전 확인
  2. apt-mark 명령어를 사용하여 패키지 업데이트를 hold하세요.
    배경색 변경
    # apt-mark hold <패키지이름>
    # apt-mark hold <패키지이름>
    코드블록. 패키지 업데이트 hold

Intel E810 드라이버 업데이트

Intel E810 드라이버의 버전을 확인하고, 권장 버전으로 업데이트하세요.

안내

드라이버 업데이트 방법은 다음과 같습니다.

  1. 기본 드라이버 tar 파일을 원하는 디렉토리로 이동합니다.
예시: /home/username/ice 또는 /usr/local/src/ice
  1. Archiver 파일을 untar / unzip하세요.

    • x.x.x는 드라이버 tar 파일의 버전 번호입니다.
      배경색 변경
      tar zxf ice-x.x.x.tar.gz
      tar zxf ice-x.x.x.tar.gz
      코드블록. 압축 파일 해제
  2. 드라이버 src 디렉토리로 변경하세요.

    • x.x.x는 드라이버 tar 파일의 버전 번호입니다.
      배경색 변경
      cd ice-x.x.x/src/
      cd ice-x.x.x/src/
      코드블록. 디렉토리 변경
  3. 드라이버 모듈을 컴파일하세요.

    배경색 변경
    make install
    make install
    코드블록. 드라이버 모듈 컴파일

  4. 업데이트가 끝난 후, 버전을 확인하세요.

    배경색 변경
    lsmod | grep ice
    modinfo ice | grep version
    lsmod | grep ice
    modinfo ice | grep version
    코드블록. 버전 확인

NVIDIA driver 확인

참고
nvidia-smi topo, IB nv_peer_mem status 확인

NVIDIA driver를 확인(nvidia-smi topo, IB nv_peer_mem status)하여 IaaS HW 레벨을 점검하려면 다음 절차를 따르세요.

  1. GPU 드라이버와 HW 상태를 확인하세요.

    배경색 변경
    user@bm-dev-001:~$ nvidia-smi topo -m
            GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    mlx5_0  mlx5_1  mlx5_2  mlx5_3  CPU Affinity    NUMA Affinity
    GPU0     X      NV12    NV12    NV12    NV12    NV12    NV12    NV12    SYS     PXB     SYS     SYS     48-63   3
    GPU1    NV12     X      NV12    NV12    NV12    NV12    NV12    NV12    SYS     PXB     SYS     SYS     48-63   3
    GPU2    NV12    NV12     X      NV12    NV12    NV12    NV12    NV12    PXB     SYS     SYS     SYS     16-31   1
    GPU3    NV12    NV12    NV12     X      NV12    NV12    NV12    NV12    PXB     SYS     SYS     SYS     16-31   1
    GPU4    NV12    NV12    NV12    NV12     X      NV12    NV12    NV12    SYS     SYS     SYS     PXB     112-127 7
    GPU5    NV12    NV12    NV12    NV12    NV12     X      NV12    NV12    SYS     SYS     SYS     PXB     112-127 7
    GPU6    NV12    NV12    NV12    NV12    NV12    NV12     X      NV12    SYS     SYS     PXB     SYS     80-95   5
    GPU7    NV12    NV12    NV12    NV12    NV12    NV12    NV12     X      SYS     SYS     PXB     SYS     80-95   5
    mlx5_0  SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS      X      SYS     SYS     SYS
    mlx5_1  PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      SYS     SYS
    mlx5_2  SYS     SYS     SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS      X      SYS
    mlx5_3  SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS      X
    
    Legend:
    
      X    = Self
      SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
      NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
      PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
      PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
      PIX  = Connection traversing at most a single PCIe bridge
      NV#  = Connection traversing a bonded set of # NVLinks
    user@bm-dev-001:~$ nvidia-smi topo -m
            GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    mlx5_0  mlx5_1  mlx5_2  mlx5_3  CPU Affinity    NUMA Affinity
    GPU0     X      NV12    NV12    NV12    NV12    NV12    NV12    NV12    SYS     PXB     SYS     SYS     48-63   3
    GPU1    NV12     X      NV12    NV12    NV12    NV12    NV12    NV12    SYS     PXB     SYS     SYS     48-63   3
    GPU2    NV12    NV12     X      NV12    NV12    NV12    NV12    NV12    PXB     SYS     SYS     SYS     16-31   1
    GPU3    NV12    NV12    NV12     X      NV12    NV12    NV12    NV12    PXB     SYS     SYS     SYS     16-31   1
    GPU4    NV12    NV12    NV12    NV12     X      NV12    NV12    NV12    SYS     SYS     SYS     PXB     112-127 7
    GPU5    NV12    NV12    NV12    NV12    NV12     X      NV12    NV12    SYS     SYS     SYS     PXB     112-127 7
    GPU6    NV12    NV12    NV12    NV12    NV12    NV12     X      NV12    SYS     SYS     PXB     SYS     80-95   5
    GPU7    NV12    NV12    NV12    NV12    NV12    NV12    NV12     X      SYS     SYS     PXB     SYS     80-95   5
    mlx5_0  SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS      X      SYS     SYS     SYS
    mlx5_1  PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      SYS     SYS
    mlx5_2  SYS     SYS     SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS      X      SYS
    mlx5_3  SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS      X
    
    Legend:
    
      X    = Self
      SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
      NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
      PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
      PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
      PIX  = Connection traversing at most a single PCIe bridge
      NV#  = Connection traversing a bonded set of # NVLinks
    코드블록. GPU 드라이버 및 HW 상태 확인

  2. NVSwitch HW 상태를 확인하세요.

    배경색 변경
    user@bm-dev-001:~$ nvidia-smi nvlink --status
    GPU 0: NVIDIA A100-SXM4-80GB (UUID: GPU-2c0d1d6b-e348-55fc-44cf-cd65a954b36c)
             Link 0: 25 GB/s
             Link 1: 25 GB/s
             Link 2: 25 GB/s
             Link 3: 25 GB/s
             Link 4: 25 GB/s
             Link 5: 25 GB/s
             Link 6: 25 GB/s
             Link 7: 25 GB/s
             Link 8: 25 GB/s
             Link 9: 25 GB/s
             Link 10: 25 GB/s
             Link 11: 25 GB/s
    GPU 1: NVIDIA A100-SXM4-80GB (UUID: GPU-96f429d8-893a-a9ea-deca-feffd90669e9)
             Link 0: 25 GB/s
             Link 1: 25 GB/s
             Link 2: 25 GB/s
             Link 3: 25 GB/s
             Link 4: 25 GB/s
             Link 5: 25 GB/s
             Link 6: 25 GB/s
             Link 7: 25 GB/s
             Link 8: 25 GB/s
             Link 9: 25 GB/s
             Link 10: 25 GB/s
             Link 11: 25 GB/s
    GPU 2: NVIDIA A100-SXM4-80GB (UUID: GPU-2e601952-b442-b757-a035-725cd320f589)
             Link 0: 25 GB/s
             Link 1: 25 GB/s
             Link 2: 25 GB/s
             Link 3: 25 GB/s
             Link 4: 25 GB/s
             Link 5: 25 GB/s
             Link 6: 25 GB/s
             Link 7: 25 GB/s
             Link 8: 25 GB/s
             Link 9: 25 GB/s
             Link 10: 25 GB/s
             Link 11: 25 GB/s
    GPU 3: NVIDIA A100-SXM4-80GB (UUID: GPU-bcbfd885-a9f8-ec8c-045b-c521472b4fed)
             Link 0: 25 GB/s
             Link 1: 25 GB/s
             Link 2: 25 GB/s
             Link 3: 25 GB/s
             Link 4: 25 GB/s
             Link 5: 25 GB/s
             Link 6: 25 GB/s
             Link 7: 25 GB/s
             Link 8: 25 GB/s
             Link 9: 25 GB/s
             Link 10: 25 GB/s
             Link 11: 25 GB/s
    GPU 4: NVIDIA A100-SXM4-80GB (UUID: GPU-30273090-2d78-fc7a-a360-ec5f871dd488)
             Link 0: 25 GB/s
             Link 1: 25 GB/s
             Link 2: 25 GB/s
             Link 3: 25 GB/s
             Link 4: 25 GB/s
             Link 5: 25 GB/s
             Link 6: 25 GB/s
             Link 7: 25 GB/s
             Link 8: 25 GB/s
             Link 9: 25 GB/s
             Link 10: 25 GB/s
             Link 11: 25 GB/s
    GPU 5: NVIDIA A100-SXM4-80GB (UUID: GPU-5ce7ef61-56dd-fb18-aa7c-be610c8d51c3)
             Link 0: 25 GB/s
             Link 1: 25 GB/s
             Link 2: 25 GB/s
             Link 3: 25 GB/s
             Link 4: 25 GB/s
             Link 5: 25 GB/s
             Link 6: 25 GB/s
             Link 7: 25 GB/s
             Link 8: 25 GB/s
             Link 9: 25 GB/s
             Link 10: 25 GB/s
             Link 11: 25 GB/s
    GPU 6: NVIDIA A100-SXM4-80GB (UUID: GPU-740a527b-b286-8b85-35eb-b6b41c0bb6d7)
             Link 0: 25 GB/s
             Link 1: 25 GB/s
             Link 2: 25 GB/s
             Link 3: 25 GB/s
             Link 4: 25 GB/s
             Link 5: 25 GB/s
             Link 6: 25 GB/s
             Link 7: 25 GB/s
             Link 8: 25 GB/s
             Link 9: 25 GB/s
             Link 10: 25 GB/s
             Link 11: 25 GB/s
    GPU 7: NVIDIA A100-SXM4-80GB (UUID: GPU-1fb6de95-60f6-dbf2-ffca-f7680577e37c)
             Link 0: 25 GB/s
             Link 1: 25 GB/s
             Link 2: 25 GB/s
             Link 3: 25 GB/s
             Link 4: 25 GB/s
             Link 5: 25 GB/s
             Link 6: 25 GB/s
             Link 7: 25 GB/s
             Link 8: 25 GB/s
             Link 9: 25 GB/s
             Link 10: 25 GB/s
             Link 11: 25 GB/s
    user@bm-dev-001:~$ nvidia-smi nvlink --status
    GPU 0: NVIDIA A100-SXM4-80GB (UUID: GPU-2c0d1d6b-e348-55fc-44cf-cd65a954b36c)
             Link 0: 25 GB/s
             Link 1: 25 GB/s
             Link 2: 25 GB/s
             Link 3: 25 GB/s
             Link 4: 25 GB/s
             Link 5: 25 GB/s
             Link 6: 25 GB/s
             Link 7: 25 GB/s
             Link 8: 25 GB/s
             Link 9: 25 GB/s
             Link 10: 25 GB/s
             Link 11: 25 GB/s
    GPU 1: NVIDIA A100-SXM4-80GB (UUID: GPU-96f429d8-893a-a9ea-deca-feffd90669e9)
             Link 0: 25 GB/s
             Link 1: 25 GB/s
             Link 2: 25 GB/s
             Link 3: 25 GB/s
             Link 4: 25 GB/s
             Link 5: 25 GB/s
             Link 6: 25 GB/s
             Link 7: 25 GB/s
             Link 8: 25 GB/s
             Link 9: 25 GB/s
             Link 10: 25 GB/s
             Link 11: 25 GB/s
    GPU 2: NVIDIA A100-SXM4-80GB (UUID: GPU-2e601952-b442-b757-a035-725cd320f589)
             Link 0: 25 GB/s
             Link 1: 25 GB/s
             Link 2: 25 GB/s
             Link 3: 25 GB/s
             Link 4: 25 GB/s
             Link 5: 25 GB/s
             Link 6: 25 GB/s
             Link 7: 25 GB/s
             Link 8: 25 GB/s
             Link 9: 25 GB/s
             Link 10: 25 GB/s
             Link 11: 25 GB/s
    GPU 3: NVIDIA A100-SXM4-80GB (UUID: GPU-bcbfd885-a9f8-ec8c-045b-c521472b4fed)
             Link 0: 25 GB/s
             Link 1: 25 GB/s
             Link 2: 25 GB/s
             Link 3: 25 GB/s
             Link 4: 25 GB/s
             Link 5: 25 GB/s
             Link 6: 25 GB/s
             Link 7: 25 GB/s
             Link 8: 25 GB/s
             Link 9: 25 GB/s
             Link 10: 25 GB/s
             Link 11: 25 GB/s
    GPU 4: NVIDIA A100-SXM4-80GB (UUID: GPU-30273090-2d78-fc7a-a360-ec5f871dd488)
             Link 0: 25 GB/s
             Link 1: 25 GB/s
             Link 2: 25 GB/s
             Link 3: 25 GB/s
             Link 4: 25 GB/s
             Link 5: 25 GB/s
             Link 6: 25 GB/s
             Link 7: 25 GB/s
             Link 8: 25 GB/s
             Link 9: 25 GB/s
             Link 10: 25 GB/s
             Link 11: 25 GB/s
    GPU 5: NVIDIA A100-SXM4-80GB (UUID: GPU-5ce7ef61-56dd-fb18-aa7c-be610c8d51c3)
             Link 0: 25 GB/s
             Link 1: 25 GB/s
             Link 2: 25 GB/s
             Link 3: 25 GB/s
             Link 4: 25 GB/s
             Link 5: 25 GB/s
             Link 6: 25 GB/s
             Link 7: 25 GB/s
             Link 8: 25 GB/s
             Link 9: 25 GB/s
             Link 10: 25 GB/s
             Link 11: 25 GB/s
    GPU 6: NVIDIA A100-SXM4-80GB (UUID: GPU-740a527b-b286-8b85-35eb-b6b41c0bb6d7)
             Link 0: 25 GB/s
             Link 1: 25 GB/s
             Link 2: 25 GB/s
             Link 3: 25 GB/s
             Link 4: 25 GB/s
             Link 5: 25 GB/s
             Link 6: 25 GB/s
             Link 7: 25 GB/s
             Link 8: 25 GB/s
             Link 9: 25 GB/s
             Link 10: 25 GB/s
             Link 11: 25 GB/s
    GPU 7: NVIDIA A100-SXM4-80GB (UUID: GPU-1fb6de95-60f6-dbf2-ffca-f7680577e37c)
             Link 0: 25 GB/s
             Link 1: 25 GB/s
             Link 2: 25 GB/s
             Link 3: 25 GB/s
             Link 4: 25 GB/s
             Link 5: 25 GB/s
             Link 6: 25 GB/s
             Link 7: 25 GB/s
             Link 8: 25 GB/s
             Link 9: 25 GB/s
             Link 10: 25 GB/s
             Link 11: 25 GB/s
    코드블록. NVSwitch HW 상태 확인

  3. InfiniBand(IB) HCA 카드 HW 상태와 Link를 확인하세요.

    배경색 변경
    user@bm-dev-001:~$ ibdev2netdev -v
    cat: /sys/class/infiniband/mlx5_0/device/vpd: Permission denied
    0000:45:00.0 mlx5_0 (MT4123 -            )                 fw 20.29.1016 port 1 (ACTIVE) ==> ibs18 (Down)
    cat: /sys/class/infiniband/mlx5_1/device/vpd: Permission denied
    0000:0e:00.0 mlx5_1 (MT4123 -            )                 fw 20.29.1016 port 1 (ACTIVE) ==> ibs17 (Down)
    cat: /sys/class/infiniband/mlx5_2/device/vpd: Permission denied
    0000:c5:00.0 mlx5_2 (MT4123 -            )                 fw 20.29.1016 port 1 (ACTIVE) ==> ibs20 (Down)
    cat: /sys/class/infiniband/mlx5_3/device/vpd: Permission denied
    0000:85:00.0 mlx5_3 (MT4123 -            )                 fw 20.29.1016 port 1 (ACTIVE) ==> ibs19 (Down)
    user@bm-dev-001:~$
    
    
    root@bm-dev-001:~# ibstat
    CA 'mlx5_0'
            CA type: MT4123
            Number of ports: 1
            Firmware version: 20.29.1016
            Hardware version: 0
            Node GUID: 0x88e9a4ffff5060ac
            System image GUID: 0x88e9a4ffff5060ac
            Port 1:
                    State: Active
                    Physical state: LinkUp
                    Rate: 200
                    Base lid: 8
                    LMC: 0
                    SM lid: 1
                    Capability mask: 0x2651e848
                    Port GUID: 0x88e9a4ffff5060ac
                    Link layer: InfiniBand
    CA 'mlx5_1'
            CA type: MT4123
            Number of ports: 1
            Firmware version: 20.29.1016
            Hardware version: 0
            Node GUID: 0x88e9a4ffff504080
            System image GUID: 0x88e9a4ffff504080
            Port 1:
                    State: Active
                    Physical state: LinkUp
                    Rate: 200
                    Base lid: 5
                    LMC: 0
                    SM lid: 1
                    Capability mask: 0x2651e848
                    Port GUID: 0x88e9a4ffff504080
                    Link layer: InfiniBand
    CA 'mlx5_2'
            CA type: MT4123
            Number of ports: 1
            Firmware version: 20.29.1016
            Hardware version: 0
            Node GUID: 0x88e9a4ffff505038
            System image GUID: 0x88e9a4ffff505038
            Port 1:
                    State: Active
                    Physical state: LinkUp
                    Rate: 200
                    Base lid: 2
                    LMC: 0
                    SM lid: 1
                    Capability mask: 0x2651e848
                    Port GUID: 0x88e9a4ffff505038
                    Link layer: InfiniBand
    CA 'mlx5_3'
            CA type: MT4123
            Number of ports: 1
            Firmware version: 20.29.1016
            Hardware version: 0
            Node GUID: 0x88e9a4ffff504094
            System image GUID: 0x88e9a4ffff504094
            Port 1:
                    State: Active
                    Physical state: LinkUp
                    Rate: 200
                    Base lid: 7
                    LMC: 0
                    SM lid: 1
                    Capability mask: 0x2651e848
                    Port GUID: 0x88e9a4ffff504094
                    Link layer: InfiniBand
    user@bm-dev-001:~$ ibdev2netdev -v
    cat: /sys/class/infiniband/mlx5_0/device/vpd: Permission denied
    0000:45:00.0 mlx5_0 (MT4123 -            )                 fw 20.29.1016 port 1 (ACTIVE) ==> ibs18 (Down)
    cat: /sys/class/infiniband/mlx5_1/device/vpd: Permission denied
    0000:0e:00.0 mlx5_1 (MT4123 -            )                 fw 20.29.1016 port 1 (ACTIVE) ==> ibs17 (Down)
    cat: /sys/class/infiniband/mlx5_2/device/vpd: Permission denied
    0000:c5:00.0 mlx5_2 (MT4123 -            )                 fw 20.29.1016 port 1 (ACTIVE) ==> ibs20 (Down)
    cat: /sys/class/infiniband/mlx5_3/device/vpd: Permission denied
    0000:85:00.0 mlx5_3 (MT4123 -            )                 fw 20.29.1016 port 1 (ACTIVE) ==> ibs19 (Down)
    user@bm-dev-001:~$
    
    
    root@bm-dev-001:~# ibstat
    CA 'mlx5_0'
            CA type: MT4123
            Number of ports: 1
            Firmware version: 20.29.1016
            Hardware version: 0
            Node GUID: 0x88e9a4ffff5060ac
            System image GUID: 0x88e9a4ffff5060ac
            Port 1:
                    State: Active
                    Physical state: LinkUp
                    Rate: 200
                    Base lid: 8
                    LMC: 0
                    SM lid: 1
                    Capability mask: 0x2651e848
                    Port GUID: 0x88e9a4ffff5060ac
                    Link layer: InfiniBand
    CA 'mlx5_1'
            CA type: MT4123
            Number of ports: 1
            Firmware version: 20.29.1016
            Hardware version: 0
            Node GUID: 0x88e9a4ffff504080
            System image GUID: 0x88e9a4ffff504080
            Port 1:
                    State: Active
                    Physical state: LinkUp
                    Rate: 200
                    Base lid: 5
                    LMC: 0
                    SM lid: 1
                    Capability mask: 0x2651e848
                    Port GUID: 0x88e9a4ffff504080
                    Link layer: InfiniBand
    CA 'mlx5_2'
            CA type: MT4123
            Number of ports: 1
            Firmware version: 20.29.1016
            Hardware version: 0
            Node GUID: 0x88e9a4ffff505038
            System image GUID: 0x88e9a4ffff505038
            Port 1:
                    State: Active
                    Physical state: LinkUp
                    Rate: 200
                    Base lid: 2
                    LMC: 0
                    SM lid: 1
                    Capability mask: 0x2651e848
                    Port GUID: 0x88e9a4ffff505038
                    Link layer: InfiniBand
    CA 'mlx5_3'
            CA type: MT4123
            Number of ports: 1
            Firmware version: 20.29.1016
            Hardware version: 0
            Node GUID: 0x88e9a4ffff504094
            System image GUID: 0x88e9a4ffff504094
            Port 1:
                    State: Active
                    Physical state: LinkUp
                    Rate: 200
                    Base lid: 7
                    LMC: 0
                    SM lid: 1
                    Capability mask: 0x2651e848
                    Port GUID: 0x88e9a4ffff504094
                    Link layer: InfiniBand
    코드블록. InfiniBand(IB) HCA 카드 HW 상태 및 Link 확인

IB bandwidth 통신 확인

IB bandwidth 통신 상태를 확인(ib_send_bw)하여 IaaS HW 레벨을 점검하려면 다음 절차를 따르세요.

  1. IB HCA 인터페이스의 이름을 확인하세요.

    배경색 변경
    user@bm-dev-001:~$ ibdev2netdev
    mlx5_0 port 1 ==> ibs18 (Down)
    mlx5_1 port 1 ==> ibs17 (Down)
    mlx5_2 port 1 ==> ibs20 (Down)
    mlx5_3 port 1 ==> ibs19 (Down)
    user@bm-dev-001:~$ ibdev2netdev
    mlx5_0 port 1 ==> ibs18 (Down)
    mlx5_1 port 1 ==> ibs17 (Down)
    mlx5_2 port 1 ==> ibs20 (Down)
    mlx5_3 port 1 ==> ibs19 (Down)
    코드블록. IB HCA 인터페이스의 이름 확인

  2. IB스위치#1과 통신이 가능한 HCA 인터페이스를 확인하세요.

    배경색 변경
    mlx5_0 port 1 ==> ibs18 (Down)
    mlx5_2 port 1 ==> ibs20 (Down)
    mlx5_0 port 1 ==> ibs18 (Down)
    mlx5_2 port 1 ==> ibs20 (Down)
    코드블록. HCA 인터페이스 확인

  3. IB스위치#2와 통신이 가능한 HCA 인터페이스를 확인하세요.

    배경색 변경
    mlx5_1 port 1 ==> ibs17 (Down)
    mlx5_3 port 1 ==> ibs19 (Down)
    mlx5_1 port 1 ==> ibs17 (Down)
    mlx5_3 port 1 ==> ibs19 (Down)
    코드블록. HCA 인터페이스 확인

  4. SERVER Side 명령어를 사용하여 상호 통신 상태를 확인하세요.

    • Client Side 명령어를 2차로 입력하여 상호 통신
      배경색 변경
      user@bm-dev-001:~$ ib_send_bw -d mlx5_3 -i 1 –F
      ************************************
      * Waiting for client to connect... *
      ************************************
      ---------------------------------------------------------------------------------------
                          Send BW Test
       Dual-port       : OFF          Device         : mlx5_3
       Number of qps   : 1            Transport type : IB
       Connection type : RC           Using SRQ      : OFF
       PCIe relax order: ON
       ibv_wr* API     : ON
       RX depth        : 512
       CQ Moderation   : 1
       Mtu             : 4096[B]
       Link type       : IB
       Max inline data : 0[B]
       rdma_cm QPs     : OFF
       Data ex. method : Ethernet
      ---------------------------------------------------------------------------------------
       local address: LID 0x07 QPN 0x002e PSN 0xa86622
       remote address: LID 0x0a QPN 0x002d PSN 0xfc58dd
      ---------------------------------------------------------------------------------------
       #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
       65536      1000             0.00               19827.40                   0.317238
      ---------------------------------------------------------------------------------------
      user@bm-dev-001:~$ ib_send_bw -d mlx5_3 -i 1 –F
      ************************************
      * Waiting for client to connect... *
      ************************************
      ---------------------------------------------------------------------------------------
                          Send BW Test
       Dual-port       : OFF          Device         : mlx5_3
       Number of qps   : 1            Transport type : IB
       Connection type : RC           Using SRQ      : OFF
       PCIe relax order: ON
       ibv_wr* API     : ON
       RX depth        : 512
       CQ Moderation   : 1
       Mtu             : 4096[B]
       Link type       : IB
       Max inline data : 0[B]
       rdma_cm QPs     : OFF
       Data ex. method : Ethernet
      ---------------------------------------------------------------------------------------
       local address: LID 0x07 QPN 0x002e PSN 0xa86622
       remote address: LID 0x0a QPN 0x002d PSN 0xfc58dd
      ---------------------------------------------------------------------------------------
       #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
       65536      1000             0.00               19827.40                   0.317238
      ---------------------------------------------------------------------------------------
      코드블록. 통신 상태 확인
  5. CLIENT Side 명령어를 사용하여 상호 통신 상태를 확인하세요.

    • SERVER Side 명령어를 1차로 입력하여 상호 통신
      배경색 변경
      root@bm-dev-003:~# ib_send_bw -d mlx5_3 -i 1 -F <SERVER Side IP>
      ---------------------------------------------------------------------------------------
                          Send BW Test
       Dual-port       : OFF          Device         : mlx5_3
       Number of qps   : 1            Transport type : IB
       Connection type : RC           Using SRQ      : OFF
       PCIe relax order: ON
       ibv_wr* API     : ON
       TX depth        : 128
       CQ Moderation   : 1
       Mtu             : 4096[B]
       Link type       : IB
       Max inline data : 0[B]
       rdma_cm QPs     : OFF
       Data ex. method : Ethernet
      ---------------------------------------------------------------------------------------
       local address: LID 0x0a QPN 0x002a PSN 0x98a48e
       remote address: LID 0x07 QPN 0x002c PSN 0xe68304
      ---------------------------------------------------------------------------------------
       #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
       65536      1000             19008.49            19006.37                  0.304102
      ---------------------------------------------------------------------------------------
      root@bm-dev-003:~# ib_send_bw -d mlx5_3 -i 1 -F <SERVER Side IP>
      ---------------------------------------------------------------------------------------
                          Send BW Test
       Dual-port       : OFF          Device         : mlx5_3
       Number of qps   : 1            Transport type : IB
       Connection type : RC           Using SRQ      : OFF
       PCIe relax order: ON
       ibv_wr* API     : ON
       TX depth        : 128
       CQ Moderation   : 1
       Mtu             : 4096[B]
       Link type       : IB
       Max inline data : 0[B]
       rdma_cm QPs     : OFF
       Data ex. method : Ethernet
      ---------------------------------------------------------------------------------------
       local address: LID 0x0a QPN 0x002a PSN 0x98a48e
       remote address: LID 0x07 QPN 0x002c PSN 0xe68304
      ---------------------------------------------------------------------------------------
       #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
       65536      1000             19008.49            19006.37                  0.304102
      ---------------------------------------------------------------------------------------
      코드블록. 통신 상태 확인

IB 서비스 유관 커널 모듈 확인

IB 서비스 유관 커널 모듈을 확인(lsmod)하여 IaaS HW 레벨을 점검하세요.

배경색 변경
user@bm-dev-001:~$ lsmod | grep nv_peer_mem
nv_peer_mem            16384  0
ib_core               315392  9 rdma_cm,ib_ipoib,nv_peer_mem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
nvidia              35315712  156 nvidia_uvm,nv_peer_mem,nvidia_modeset
user@bm-dev-001:~$ lsmod | grep nv_peer_mem
nv_peer_mem            16384  0
ib_core               315392  9 rdma_cm,ib_ipoib,nv_peer_mem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
nvidia              35315712  156 nvidia_uvm,nv_peer_mem,nvidia_modeset
코드블록. IB 서비스 유관 커널 모듈 확인(1)
배경색 변경
user@bm-dev-001:~$ service nv_peer_mem status
  nv_peer_mem.service - LSB: Activates/Deactivates nv_peer_mem to \ start at boot time.
     Loaded: loaded (/etc/init.d/nv_peer_mem; generated)
     Active: active (exited) since Mon 2023-03-13 16:21:33 KST; 2 days ago
       Docs: man:systemd-sysv-generator(8)
    Process: 4913 ExecStart=/etc/init.d/nv_peer_mem start (code=exited, status=0/SUCCESS)
user@bm-dev-001:~$ service nv_peer_mem status
  nv_peer_mem.service - LSB: Activates/Deactivates nv_peer_mem to \ start at boot time.
     Loaded: loaded (/etc/init.d/nv_peer_mem; generated)
     Active: active (exited) since Mon 2023-03-13 16:21:33 KST; 2 days ago
       Docs: man:systemd-sysv-generator(8)
    Process: 4913 ExecStart=/etc/init.d/nv_peer_mem start (code=exited, status=0/SUCCESS)
코드블록. IB 서비스 유관 커널 모듈 확인(2)
배경색 변경
user@bm-dev-001:~$ lsmod | grep ib
libiscsi_tcp           32768  1 iscsi_tcp
libiscsi               57344  2 libiscsi_tcp,iscsi_tcp
scsi_transport_iscsi   110592  4 libiscsi_tcp,iscsi_tcp,libiscsi
ib_ipoib              131072  0
ib_cm                  57344  2 rdma_cm,ib_ipoib
ib_umad                24576  8
mlx5_ib               380928  0
ib_uverbs             135168  18 rdma_ucm,mlx5_ib
ib_core               315392  9 rdma_cm,ib_ipoib,nv_peer_mem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
libcrc32c              16384  2 btrfs,raid456
mlx5_core            1458176  1 mlx5_ib
auxiliary              16384  2 mlx5_ib,mlx5_core
mlx_compat             65536  12 rdma_cm,ib_ipoib,mlxdevm,iw_cm,auxiliary,ib_umad,ib_core,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm,mlx5_core
user@bm-dev-001:~$ lsmod | grep ib
libiscsi_tcp           32768  1 iscsi_tcp
libiscsi               57344  2 libiscsi_tcp,iscsi_tcp
scsi_transport_iscsi   110592  4 libiscsi_tcp,iscsi_tcp,libiscsi
ib_ipoib              131072  0
ib_cm                  57344  2 rdma_cm,ib_ipoib
ib_umad                24576  8
mlx5_ib               380928  0
ib_uverbs             135168  18 rdma_ucm,mlx5_ib
ib_core               315392  9 rdma_cm,ib_ipoib,nv_peer_mem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
libcrc32c              16384  2 btrfs,raid456
mlx5_core            1458176  1 mlx5_ib
auxiliary              16384  2 mlx5_ib,mlx5_core
mlx_compat             65536  12 rdma_cm,ib_ipoib,mlxdevm,iw_cm,auxiliary,ib_umad,ib_core,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm,mlx5_core
코드블록. IB 서비스 유관 커널 모듈 확인(3)

스토리지 물리 디스크 자원 및 Multi-Path 확인

스토리지 물리 디스크 자원 및 Multi-Path를 확인하여 IaaS HW 레벨을 점검하세요.

배경색 변경
root@bm-dev-002:/tmp# fdisk –l
root@bm-dev-002:/tmp# fdisk –l
코드블록. 스토리지 물리 디스크 자원 확인 결과
배경색 변경
root@bm-dev-002:/tmp# multipath –ll
root@bm-dev-002:/tmp# multipath –ll
코드블록. Multi-Path 확인 결과

Multi-node GPU Cluster 신규 배포 후 Service Network 확인

다음 명령어를 이용하여 BondingSlave InterfaceMII Status가 up인지 확인하세요.

  • 명령어

    배경색 변경
    root@mngc-001:~# cat /proc/net/bonding/bond-srv
    Ethernet Channel Bonding Driver: v5.15.0-25-generic
    root@mngc-001:~# cat /proc/net/bonding/bond-srv
    Ethernet Channel Bonding Driver: v5.15.0-25-generic
    코드블록. Service Network 확인 명령어

  • 확인 결과

    배경색 변경
    Bonding Mode: fault-tolerance (active-backup)
    Primary Slave: None
    Currently Active Slave: ens9f0
    MII Status: up
    MII Polling Interval (ms): 100
    Up Delay (ms): 0
    Down Delay (ms): 0
    Peer Notification Delay (ms): 0
    
    Slave Interface: ens9f0
    MII Status: up
    Speed: 100000 Mbps
    Duplex: full
    Link Failure Count: 0
    Permanent HW addr: 30:3e:a7:02:35:70
    Slave queue ID: 0
    
    Slave Interface: ens11f0
    MII Status: up
    Speed: 100000 Mbps
    Duplex: full
    Link Failure Count: 0
    Permanent HW addr: 30:3e:a7:02:2f:e8
    Slave queue ID: 0
    Bonding Mode: fault-tolerance (active-backup)
    Primary Slave: None
    Currently Active Slave: ens9f0
    MII Status: up
    MII Polling Interval (ms): 100
    Up Delay (ms): 0
    Down Delay (ms): 0
    Peer Notification Delay (ms): 0
    
    Slave Interface: ens9f0
    MII Status: up
    Speed: 100000 Mbps
    Duplex: full
    Link Failure Count: 0
    Permanent HW addr: 30:3e:a7:02:35:70
    Slave queue ID: 0
    
    Slave Interface: ens11f0
    MII Status: up
    Speed: 100000 Mbps
    Duplex: full
    Link Failure Count: 0
    Permanent HW addr: 30:3e:a7:02:2f:e8
    Slave queue ID: 0
    코드블록. Service Network 확인 명령어 확인 결과

참고
일부 Slave Interfacedown 상태일 경우, Support Center의 문의하기를 이용하여 이상 상황을 전달하고 조치받으세요.

Multi-node GPU Cluster 신규 배포 후 Time Server와 시간 동기화 확인

OS 이미지에는 chrony 데몬 설치 및 SCP NTP 서버 동기화가 설정되어 있습니다. 다음 명령어를 이용하여 MS Name 열에 ^*로 표기된 라인이 있는지 확인하세요.

  • 명령어

    배경색 변경
    root@mngc-001:~# chronyc sources -V
    root@mngc-001:~# chronyc sources -V
    코드블록. chrony 데몬 설치 명령어

  • 확인 결과

    배경색 변경
    MS Name/IP address         Stratum Poll Reach LastRx Last sample
    ===============================================================================
    ^+ 198.19.1.53                   4  10   377  1040    -16us[  -37us] +/- 9982us
    ^* 198.19.1.54                   4  10   377   312   -367us[ -388us] +/-   13ms
    MS Name/IP address         Stratum Poll Reach LastRx Last sample
    ===============================================================================
    ^+ 198.19.1.53                   4  10   377  1040    -16us[  -37us] +/- 9982us
    ^* 198.19.1.54                   4  10   377   312   -367us[ -388us] +/-   13ms
    코드블록. chrony 데몬 설치 확인 결과

GPU MIG/ECC 설정 초기화 점검 가이드

Multi-node GPU Cluster 상품 신청 시 GPU MIG/ECC 설정을 초기화합니다. 하지만 정확한 설정값을 적용하기 위하여 최초 한 번 리부팅을 진행한 후, 점검 가이드에 따라 설정값 적용 여부를 직접 확인하고 사용해주시기 바랍니다.

참고
  • MIG: Multi-Instance GPU
  • ECC: Error Correction Code

MIG 설정 초기화

MIG 설정값을 확인하고 초기화하는 방법은 다음을 참조하세요.

다음 명령어를 사용하여 MIG M.의 상태값이 Disabled인지 확인하세요.

  • 명령어

    배경색 변경
    root@bm-dev-001:~#nvidia-smi
    root@bm-dev-001:~#nvidia-smi
    코드블록. MIG M. 설정 초기화

  • 확인 결과

    배경색 변경
    +-----------------------------------------------------------------------------------------+
    |  NVIDIA-SMI 470.129.06        Driver version: 470.129.06        CUDA Version: 11.4      |
    |----------------------------------+-----------------------------+------------------------|
    |  GPU  Name        Persistence-M  |  Bus-Id             Disp.A  |  Volatile Uncorr. ECC  |
    |  Fan  Temp  Perf  Pwr:Usage/Cap  |               Memory-Usage  |  GPU-Util  Compute M.  |
    |                                  |                             |                MIG M.  |
    |==================================+=============================+========================|
    |    0  NVIDIA A100-SXM...    Off  |  00000000:03:00.0      Off  |                   Off  |
    |  N/A  29C     P0    57W  /  400W |          0MiB  /  81251MiB  |    0%         Default  |
    |                                  |                             |              Disabled  |
    +----------------------------------+-----------------------------+------------------------+
    |    0  NVIDIA A100-SXM...    Off  |  00000000:0C:00.0      Off  |                   Off  |
    |  N/A  30C     P0    58W  /  400W |          0MiB  /  81251MiB  |    18%        Default  |
    |                                  |                             |              Disabled  |
    +-----------------------------------------------------------------------------------------+
    +-----------------------------------------------------------------------------------------+
    |  NVIDIA-SMI 470.129.06        Driver version: 470.129.06        CUDA Version: 11.4      |
    |----------------------------------+-----------------------------+------------------------|
    |  GPU  Name        Persistence-M  |  Bus-Id             Disp.A  |  Volatile Uncorr. ECC  |
    |  Fan  Temp  Perf  Pwr:Usage/Cap  |               Memory-Usage  |  GPU-Util  Compute M.  |
    |                                  |                             |                MIG M.  |
    |==================================+=============================+========================|
    |    0  NVIDIA A100-SXM...    Off  |  00000000:03:00.0      Off  |                   Off  |
    |  N/A  29C     P0    57W  /  400W |          0MiB  /  81251MiB  |    0%         Default  |
    |                                  |                             |              Disabled  |
    +----------------------------------+-----------------------------+------------------------+
    |    0  NVIDIA A100-SXM...    Off  |  00000000:0C:00.0      Off  |                   Off  |
    |  N/A  30C     P0    58W  /  400W |          0MiB  /  81251MiB  |    18%        Default  |
    |                                  |                             |              Disabled  |
    +-----------------------------------------------------------------------------------------+
    코드블록. MIG M. 설정 초기화 확인 결과

  • MIG M.의 상태값이 Disabled이 아닐 경우, 다음 명령어를 사용하여 MIG를 초기화하세요.

    배경색 변경
    root@bm-dev-001:~# nvidia-smi -mig 0
    root@bm-dev-001:~# nvidia-smi --gpu-reset
    root@bm-dev-001:~# nvidia-smi -mig 0
    root@bm-dev-001:~# nvidia-smi --gpu-reset
    코드블록. MIG M. 상태값 초기화

ECC 설정 초기화

ECC 설정값을 확인하고 초기화하는 방법은 다음을 참조하세요.

다음 명령어를 사용하여 Volatile Uncorr. ECC의 상태값이 Off인지 확인하세요.

  • 명령어

    배경색 변경
    root@bm-dev-001:~#nvidia-smi
    root@bm-dev-001:~#nvidia-smi
    코드블록. ECC 설정 명령어

  • 확인 결과

    배경색 변경
    +-----------------------------------------------------------------------------------------+
    |  NVIDIA-SMI 470.129.06        Driver version: 470.129.06        CUDA Version: 11.4      |
    |----------------------------------+-----------------------------+------------------------|
    |  GPU  Name        Persistence-M  |  Bus-Id             Disp.A  |  Volatile Uncorr. ECC  |
    |  Fan  Temp  Perf  Pwr:Usage/Cap  |               Memory-Usage  |  GPU-Util  Compute M.  |
    |                                  |                             |                MIG M.  |
    |==================================+=============================+========================|
    |    0  NVIDIA A100-SXM...    Off  |  00000000:03:00.0      Off  |                   Off  |
    |  N/A  29C     P0    57W  /  400W |          0MiB  /  81251MiB  |    0%         Default  |
    |                                  |                             |              Disabled  |
    +----------------------------------+-----------------------------+------------------------+
    |    0  NVIDIA A100-SXM...    Off  |  00000000:0C:00.0      Off  |                   Off  |
    |  N/A  30C     P0    61W  /  400W |          0MiB  /  81251MiB  |    18%        Default  |
    |                                  |                             |              Disabled  |
    +-----------------------------------------------------------------------------------------+
    +-----------------------------------------------------------------------------------------+
    |  NVIDIA-SMI 470.129.06        Driver version: 470.129.06        CUDA Version: 11.4      |
    |----------------------------------+-----------------------------+------------------------|
    |  GPU  Name        Persistence-M  |  Bus-Id             Disp.A  |  Volatile Uncorr. ECC  |
    |  Fan  Temp  Perf  Pwr:Usage/Cap  |               Memory-Usage  |  GPU-Util  Compute M.  |
    |                                  |                             |                MIG M.  |
    |==================================+=============================+========================|
    |    0  NVIDIA A100-SXM...    Off  |  00000000:03:00.0      Off  |                   Off  |
    |  N/A  29C     P0    57W  /  400W |          0MiB  /  81251MiB  |    0%         Default  |
    |                                  |                             |              Disabled  |
    +----------------------------------+-----------------------------+------------------------+
    |    0  NVIDIA A100-SXM...    Off  |  00000000:0C:00.0      Off  |                   Off  |
    |  N/A  30C     P0    61W  /  400W |          0MiB  /  81251MiB  |    18%        Default  |
    |                                  |                             |              Disabled  |
    +-----------------------------------------------------------------------------------------+
    코드블록. ECC 설정 확인 결과

  • Volatile Uncorr. ECC의 상태값이 On*일 경우, 리부팅을 진행하세요.

  • Volatile Uncorr. ECC의 상태값이 On*이나 Off가 아닐 경우, 다음 명령어를 사용하여 ECC를 초기화하세요. 초기화가 끝나면 리부팅 후, 상태값이 Off인지 확인하세요.

배경색 변경
root@bm-dev-001:~# nvidia-smi --ecc-config=0
root@bm-dev-001:~# nvidia-smi --ecc-config=0
코드블록. ECC 상태값 확인
ServiceWatch Agent 설치하기
Release Note