Multi-node GPU Cluster 서비스 범위 및 점검 가이드
Multi-node GPU Cluster 서비스 범위
Multi-node GPU Cluster 서비스의 IaaS HW 레벨 문제 발생 시 Support Center의 문의하기를 통해 기술 지원을 받을 수 있습니다. 하지만 OS Kernel 업데이트 또는 애플리케이션 설치 등의 변경에 따른 리스크는 사용자의 영역이므로 기술 지원이 어려우니, 시스템 업데이트 등의 작업에 유의해 주시기 바랍니다.
IaaS HW 레벨 문제
- IPMI HW 모니터링 콘솔에서 발생하는 서버 내의 HW fault event 발생 메시지
- nvdia-smi 명령에서 확인되는 GPU HW 동작 오류
- InfiniBand HCA 카드 또는 InfiniBand Switch 점검에서 발생하는 HW 오류 메시지
IaaS HW 점검 가이드
Multi-node GPU Cluster 서비스를 신청한 후에는 점검 가이드에 따라 IaaS HW 레벨을 점검하는 것을 권장합니다.
Intel E810 드라이버 업데이트
Intel E810 드라이버의 버전을 확인하고 필요 시 다음 절차를 참고하여 업데이트를 진행하세요.
- 서버 제조사 Intel E810 드라이버 최소 권장 버전: 1.15.4 이상
- 드라이버 다운로드: Intel Network Adapter Driver for 800 Series Devices under Linux
lspci 명령을 이용하여 E810 NIC 장치의 유무를 확인할 수 있습니다.
E810 NIC가 유효한 경우, E810-C 정보로 다음과 같이 PCIe 장치가 확인됩니다(E810 NIC 장치가 없을 경우에는 해당 작업 미수행)
<div class="code-block-buttons">
<button class="code-block-download">
<div class="code-block-download-icon">
<svg width="14" height="16" viewBox="0 0 14 16" fill="none" xmlns="http://www.w3.org/2000/svg"><path d="M13.999 14.4353v-2.998C13.999 11.0232 13.6602 10.6853 13.2461 10.6853S12.5 11.0232 12.5 11.4373v2.2519H1.5V11.4373C1.5 11.0232 1.16211 10.6853.748047 10.6853.333984 10.6853 976563e-9 11.0232 976562e-9 11.4373v2.998C976562e-9 14.8494.333984 15.1892.748047 15.1892H13.2461c.414099999999999.0.7529-.3398.7529-.7539z" fill="#5135ff"/><path d="M1.41169 6.21654c.25876-.32345.73073-.37589 1.05417-.11713l3.78463 3.0277V1.56104c0-.41422.33578-.750005.75-.750005.41421.0.75.335785.75.750005V9.12208L11.5288 6.09941C11.8523 5.84065 12.3242 5.89309 12.583 6.21654 12.8418 6.53999 12.7893 7.01196 12.4659 7.27071L6.99734 11.6455 1.52882 7.27071c-.32345-.25875-.37589-.73072-.11713-1.05417z" fill="#5135ff"/></svg></div>
Sample Code Download
</button>
<button class="code-block-copy">
<div class="code-block-copy-icon">
<svg width="14" height="14" viewBox="0 0 14 14" fill="none" xmlns="http://www.w3.org/2000/svg"><path fill-rule="evenodd" clip-rule="evenodd" d="M10 4.00012V1c0-.552285-.44772-1-1-1H1C.447715.0.0.447715.0 1V9c0 .55228.447715 1 1 1H3.99988v3.0001C3.99988 13.5515 4.44849 14 4.99988 14H10.0002l3.9997-3.9995V5c0-.55139-.448600000000001-.99988-1-.99988H10zM1.4 1.4V8.6H3.99988V5c0-.55139.44861-.99988 1-.99988H8.6V1.4H1.4zM5.3999 12.6H9.08295V9.78301C9.08295 9.39641 9.39635 9.08301 9.78295 9.08301H12.5999V5.40015h-7.2V12.6zm6.1375-2.117-1.0545 1.0544V10.483h1.0545z" fill="currentcolor"/></svg></div>
Copy Code
</button>
</div>
lspci | grep E810
0000:6a:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-C for QSFP (rev 02) 0000:6a:00.1 Ethernet controller: Intel Corporation Ethernet Controller E810-C for QSFP (rev 02)lspci | grep E810
0000:6a:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-C for QSFP (rev 02) 0000:6a:00.1 Ethernet controller: Intel Corporation Ethernet Controller E810-C for QSFP (rev 02)</div>
드라이버를 업데이트하려면 다음 절차를 따르세요.
기본 드라이버 tar 파일을 원하는 디렉토리로 이동하세요.
배경색 변경/usr/local/src/usr/local/src디렉토리 이동 예시 Archiver 파일을 untar / unzip하세요.
- x.x.x는 드라이버 tar 파일의 버전 번호입니다.배경색 변경
tar zxf ice-x.x.x.tar.gztar zxf ice-x.x.x.tar.gzuntar 예시
- x.x.x는 드라이버 tar 파일의 버전 번호입니다.
드라이버 src 디렉토리로 변경하세요.
- x.x.x는 드라이버 tar 파일의 버전 번호입니다.배경색 변경
cd ice-x.x.x/src/cd ice-x.x.x/src/src 디렉토리로 변경 예시
- x.x.x는 드라이버 tar 파일의 버전 번호입니다.
드라이버 모듈을 컴파일하세요.
배경색 변경make installmake install드라이버 모듈 컴파일 예시 업데이트가 끝난 후, 버전을 확인하세요.
배경색 변경lsmod | grep ice modinfo ice | grep versionlsmod | grep ice modinfo ice | grep version버전 확인 예시
NVIDIA driver 확인
NVIDIA driver를 확인(nvidia-smi topo, IB nv_peer_mem status)하여 IaaS HW 레벨을 점검하려면 다음 절차를 따르세요.
GPU 드라이버 상태를 확인하세요.
배경색 변경~$ nvidia-smi~$ nvidia-smiGPU 드라이버 상태 확인 코드 예시 배경색 변경Thu Jan 29 14:48:31 2026 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.183.06 Driver Version: 535.183.06 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA A100-SXM4-80GB On | 00000000:00:05.0 Off | On | | N/A 36C P0 52W / 400W | 0MiB / 81920MiB | N/A Default | | | | Enabled | +-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA A100-SXM4-80GB On | 00000000:00:06.0 Off | 0 | | N/A 36C P0 61W / 400W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 2 NVIDIA A100-SXM4-80GB On | 00000000:00:07.0 Off | 0 | | N/A 36C P0 64W / 400W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 3 NVIDIA A100-SXM4-80GB On | 00000000:00:08.0 Off | 0 | | N/A 40C P0 64W / 400W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 4 NVIDIA A100-SXM4-80GB On | 00000000:00:09.0 Off | 0 | | N/A 36C P0 63W / 400W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 5 NVIDIA A100-SXM4-80GB On | 00000000:00:0A.0 Off | 0 | | N/A 40C P0 64W / 400W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 6 NVIDIA A100-SXM4-80GB On | 00000000:00:0B.0 Off | 0 | | N/A 39C P0 65W / 400W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 7 NVIDIA A100-SXM4-80GB On | 00000000:00:0C.0 Off | 0 | | N/A 39C P0 60W / 400W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | MIG devices: | +------------------+--------------------------------+-----------+-----------------------+ | GPU GI CI MIG | Memory-Usage | Vol| Shared | | ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG | | | | ECC| | |==================+================================+===========+=======================| | No MIG devices found | +---------------------------------------------------------------------------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+Thu Jan 29 14:48:31 2026 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.183.06 Driver Version: 535.183.06 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA A100-SXM4-80GB On | 00000000:00:05.0 Off | On | | N/A 36C P0 52W / 400W | 0MiB / 81920MiB | N/A Default | | | | Enabled | +-----------------------------------------+----------------------+----------------------+ | 1 NVIDIA A100-SXM4-80GB On | 00000000:00:06.0 Off | 0 | | N/A 36C P0 61W / 400W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 2 NVIDIA A100-SXM4-80GB On | 00000000:00:07.0 Off | 0 | | N/A 36C P0 64W / 400W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 3 NVIDIA A100-SXM4-80GB On | 00000000:00:08.0 Off | 0 | | N/A 40C P0 64W / 400W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 4 NVIDIA A100-SXM4-80GB On | 00000000:00:09.0 Off | 0 | | N/A 36C P0 63W / 400W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 5 NVIDIA A100-SXM4-80GB On | 00000000:00:0A.0 Off | 0 | | N/A 40C P0 64W / 400W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 6 NVIDIA A100-SXM4-80GB On | 00000000:00:0B.0 Off | 0 | | N/A 39C P0 65W / 400W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ | 7 NVIDIA A100-SXM4-80GB On | 00000000:00:0C.0 Off | 0 | | N/A 39C P0 60W / 400W | 0MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | MIG devices: | +------------------+--------------------------------+-----------+-----------------------+ | GPU GI CI MIG | Memory-Usage | Vol| Shared | | ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG | | | | ECC| | |==================+================================+===========+=======================| | No MIG devices found | +---------------------------------------------------------------------------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+GPU 드라이버 상태 예시 NVSwitch 및 NVLink HW 상태를 확인하세요.
NVSwitch 상태 확인
배경색 변경~$ nvidia-smi nvlink --status~$ nvidia-smi nvlink --statusNVSwitch 상태 확인 예시 배경색 변경GPU 1: NVIDIA A100-SXM4-80GB (UUID: GPU-64a2f685-bb12-c4af-105c-0726ece9c8d7) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 2: NVIDIA A100-SXM4-80GB (UUID: GPU-2269851b-71cd-f6c7-50c5-ba1525cf3ce8) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 3: NVIDIA A100-SXM4-80GB (UUID: GPU-4c397bbf-95fc-5c29-918a-a429cbe45a7a) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 4: NVIDIA A100-SXM4-80GB (UUID: GPU-0e350204-9fb6-2cbe-538e-8f7849658eb8) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 5: NVIDIA A100-SXM4-80GB (UUID: GPU-45f0c453-4760-edd4-3af9-25c5ea7473a5) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 6: NVIDIA A100-SXM4-80GB (UUID: GPU-38409794-bb34-430e-3c50-90b42cb2bb72) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 7: NVIDIA A100-SXM4-80GB (UUID: GPU-3fb478aa-801b-eb64-55c2-0ffc3f2ce404) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/sGPU 1: NVIDIA A100-SXM4-80GB (UUID: GPU-64a2f685-bb12-c4af-105c-0726ece9c8d7) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 2: NVIDIA A100-SXM4-80GB (UUID: GPU-2269851b-71cd-f6c7-50c5-ba1525cf3ce8) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 3: NVIDIA A100-SXM4-80GB (UUID: GPU-4c397bbf-95fc-5c29-918a-a429cbe45a7a) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 4: NVIDIA A100-SXM4-80GB (UUID: GPU-0e350204-9fb6-2cbe-538e-8f7849658eb8) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 5: NVIDIA A100-SXM4-80GB (UUID: GPU-45f0c453-4760-edd4-3af9-25c5ea7473a5) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 6: NVIDIA A100-SXM4-80GB (UUID: GPU-38409794-bb34-430e-3c50-90b42cb2bb72) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 7: NVIDIA A100-SXM4-80GB (UUID: GPU-3fb478aa-801b-eb64-55c2-0ffc3f2ce404) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/sNVSwitch 상태 예시 NVLinks HW 상태 확인
배경색 변경~$ nvidia-smi topo -m~$ nvidia-smi topo -mNVLink HW 상태 확인 코드 예시 배경색 변경GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X NV12 NV12 NV12 NV12 NV12 NV12 NV12 0-127 0-7 N/A GPU1 NV12 X NV12 NV12 NV12 NV12 NV12 NV12 0-127 0-7 N/A GPU2 NV12 NV12 X NV12 NV12 NV12 NV12 NV12 0-127 0-7 N/A GPU3 NV12 NV12 NV12 X NV12 NV12 NV12 NV12 0-127 0-7 N/A GPU4 NV12 NV12 NV12 NV12 X NV12 NV12 NV12 0-127 0-7 N/A GPU5 NV12 NV12 NV12 NV12 NV12 X NV12 NV12 0-127 0-7 N/A GPU6 NV12 NV12 NV12 NV12 NV12 NV12 X NV12 0-127 0-7 N/A GPU7 NV12 NV12 NV12 NV12 NV12 NV12 NV12 X 0-127 0-7 N/A Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinksGPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X NV12 NV12 NV12 NV12 NV12 NV12 NV12 0-127 0-7 N/A GPU1 NV12 X NV12 NV12 NV12 NV12 NV12 NV12 0-127 0-7 N/A GPU2 NV12 NV12 X NV12 NV12 NV12 NV12 NV12 0-127 0-7 N/A GPU3 NV12 NV12 NV12 X NV12 NV12 NV12 NV12 0-127 0-7 N/A GPU4 NV12 NV12 NV12 NV12 X NV12 NV12 NV12 0-127 0-7 N/A GPU5 NV12 NV12 NV12 NV12 NV12 X NV12 NV12 0-127 0-7 N/A GPU6 NV12 NV12 NV12 NV12 NV12 NV12 X NV12 0-127 0-7 N/A GPU7 NV12 NV12 NV12 NV12 NV12 NV12 NV12 X 0-127 0-7 N/A Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinksNVLink HW 상태 확인 코드 예시
InfiniBand(IB) HCA 카드 HW 상태와 Link를 확인하세요.
배경색 변경user@bm-dev-001:~$ ibdev2netdev -vuser@bm-dev-001:~$ ibdev2netdev -vHW 상태 확인 명령어 예시 배경색 변경cat: /sys/class/infiniband/mlx5_0/device/vpd: Permission denied 0000:45:00.0 mlx5_0 (MT4123 - ) fw 20.29.1016 port 1 (ACTIVE) ==> ibs18 (Down) cat: /sys/class/infiniband/mlx5_1/device/vpd: Permission denied 0000:0e:00.0 mlx5_1 (MT4123 - ) fw 20.29.1016 port 1 (ACTIVE) ==> ibs17 (Down) cat: /sys/class/infiniband/mlx5_2/device/vpd: Permission denied 0000:c5:00.0 mlx5_2 (MT4123 - ) fw 20.29.1016 port 1 (ACTIVE) ==> ibs20 (Down) cat: /sys/class/infiniband/mlx5_3/device/vpd: Permission denied 0000:85:00.0 mlx5_3 (MT4123 - ) fw 20.29.1016 port 1 (ACTIVE) ==> ibs19 (Down) user@bm-dev-001:~$cat: /sys/class/infiniband/mlx5_0/device/vpd: Permission denied 0000:45:00.0 mlx5_0 (MT4123 - ) fw 20.29.1016 port 1 (ACTIVE) ==> ibs18 (Down) cat: /sys/class/infiniband/mlx5_1/device/vpd: Permission denied 0000:0e:00.0 mlx5_1 (MT4123 - ) fw 20.29.1016 port 1 (ACTIVE) ==> ibs17 (Down) cat: /sys/class/infiniband/mlx5_2/device/vpd: Permission denied 0000:c5:00.0 mlx5_2 (MT4123 - ) fw 20.29.1016 port 1 (ACTIVE) ==> ibs20 (Down) cat: /sys/class/infiniband/mlx5_3/device/vpd: Permission denied 0000:85:00.0 mlx5_3 (MT4123 - ) fw 20.29.1016 port 1 (ACTIVE) ==> ibs19 (Down) user@bm-dev-001:~$HW 상태 확인 결과 예시 배경색 변경root@bm-dev-001:~# ibstatroot@bm-dev-001:~# ibstatLink 확인 명령어 예시 배경색 변경CA 'mlx5_0' CA type: MT4123 Number of ports: 1 Firmware version: 20.29.1016 Hardware version: 0 Node GUID: 0x88e9a4ffff5060ac System image GUID: 0x88e9a4ffff5060ac Port 1: State: Active Physical state: LinkUp Rate: 200 Base lid: 8 LMC: 0 SM lid: 1 Capability mask: 0x2651e848 Port GUID: 0x88e9a4ffff5060ac Link layer: InfiniBand CA 'mlx5_1' CA type: MT4123 Number of ports: 1 Firmware version: 20.29.1016 Hardware version: 0 Node GUID: 0x88e9a4ffff504080 System image GUID: 0x88e9a4ffff504080 Port 1: State: Active Physical state: LinkUp Rate: 200 Base lid: 5 LMC: 0 SM lid: 1 Capability mask: 0x2651e848 Port GUID: 0x88e9a4ffff504080 Link layer: InfiniBand CA 'mlx5_2' CA type: MT4123 Number of ports: 1 Firmware version: 20.29.1016 Hardware version: 0 Node GUID: 0x88e9a4ffff505038 System image GUID: 0x88e9a4ffff505038 Port 1: State: Active Physical state: LinkUp Rate: 200 Base lid: 2 LMC: 0 SM lid: 1 Capability mask: 0x2651e848 Port GUID: 0x88e9a4ffff505038 Link layer: InfiniBand CA 'mlx5_3' CA type: MT4123 Number of ports: 1 Firmware version: 20.29.1016 Hardware version: 0 Node GUID: 0x88e9a4ffff504094 System image GUID: 0x88e9a4ffff504094 Port 1: State: Active Physical state: LinkUp Rate: 200 Base lid: 7 LMC: 0 SM lid: 1 Capability mask: 0x2651e848 Port GUID: 0x88e9a4ffff504094 Link layer: InfiniBandCA 'mlx5_0' CA type: MT4123 Number of ports: 1 Firmware version: 20.29.1016 Hardware version: 0 Node GUID: 0x88e9a4ffff5060ac System image GUID: 0x88e9a4ffff5060ac Port 1: State: Active Physical state: LinkUp Rate: 200 Base lid: 8 LMC: 0 SM lid: 1 Capability mask: 0x2651e848 Port GUID: 0x88e9a4ffff5060ac Link layer: InfiniBand CA 'mlx5_1' CA type: MT4123 Number of ports: 1 Firmware version: 20.29.1016 Hardware version: 0 Node GUID: 0x88e9a4ffff504080 System image GUID: 0x88e9a4ffff504080 Port 1: State: Active Physical state: LinkUp Rate: 200 Base lid: 5 LMC: 0 SM lid: 1 Capability mask: 0x2651e848 Port GUID: 0x88e9a4ffff504080 Link layer: InfiniBand CA 'mlx5_2' CA type: MT4123 Number of ports: 1 Firmware version: 20.29.1016 Hardware version: 0 Node GUID: 0x88e9a4ffff505038 System image GUID: 0x88e9a4ffff505038 Port 1: State: Active Physical state: LinkUp Rate: 200 Base lid: 2 LMC: 0 SM lid: 1 Capability mask: 0x2651e848 Port GUID: 0x88e9a4ffff505038 Link layer: InfiniBand CA 'mlx5_3' CA type: MT4123 Number of ports: 1 Firmware version: 20.29.1016 Hardware version: 0 Node GUID: 0x88e9a4ffff504094 System image GUID: 0x88e9a4ffff504094 Port 1: State: Active Physical state: LinkUp Rate: 200 Base lid: 7 LMC: 0 SM lid: 1 Capability mask: 0x2651e848 Port GUID: 0x88e9a4ffff504094 Link layer: InfiniBandLink 확인 결과 예시
IB bandwidth 통신 확인
IB bandwidth 통신 상태를 확인(ib_send_bw)하여 IaaS HW 레벨을 점검하려면 다음 절차를 따르세요.
IB HCA 인터페이스의 이름을 확인하세요.
- 다음 예시에서 IB용 Port:
mlx5_0,mlx5_4,mlx_5_5,mlx5_8배경색 변경~$ ibdev2netdev -v~$ ibdev2netdev -vIB HCA 인터페이스 이름 확인 예시 배경색 변경0000:1a:00.0 mlx5_0 (MT4129 - 1028SN ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (ACTIVE) ==> ibp26s0 (Down) 0000:1b:00.0 mlx5_1 (MT4123 - 1028SN ) Mellanox ConnectX-6 Single Port VPI HDR QSFP Adapter fw 20.38.1002 port 1 (ACTIVE) ==> bond-nas (Up) 0000:3c:00.0 mlx5_2 (MT4129 - 1028SN ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (DOWN ) ==> ibp60s0 (Down) 0000:4d:00.0 mlx5_3 (MT4129 - 1028SN ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (DOWN ) ==> ibp77s0 (Down) 0000:5e:00.0 mlx5_4 (MT4129 - 1028SN ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (ACTIVE) ==> ibp94s0 (Down) 0000:9c:00.0 mlx5_5 (MT4129 - 1028SN ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (ACTIVE) ==> ibp156s0 (Down) 0000:9d:00.0 mlx5_6 (MT4123 - 1028SN ) Mellanox ConnectX-6 Single Port VPI HDR QSFP Adapter fw 20.38.1002 port 1 (ACTIVE) ==> bond-nas (Up) 0000:bc:00.0 mlx5_7 (MT4129 - 1028SN ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (DOWN ) ==> ibp188s0 (Down) 0000:cc:00.0 mlx5_8 (MT4129 - 1028SN ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (ACTIVE) ==> ibp204s0 (Down) 0000:dc:00.0 mlx5_9 (MT4129 - 1028SN ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (DOWN ) ==> ibp220s0 (Down)0000:1a:00.0 mlx5_0 (MT4129 - 1028SN ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (ACTIVE) ==> ibp26s0 (Down) 0000:1b:00.0 mlx5_1 (MT4123 - 1028SN ) Mellanox ConnectX-6 Single Port VPI HDR QSFP Adapter fw 20.38.1002 port 1 (ACTIVE) ==> bond-nas (Up) 0000:3c:00.0 mlx5_2 (MT4129 - 1028SN ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (DOWN ) ==> ibp60s0 (Down) 0000:4d:00.0 mlx5_3 (MT4129 - 1028SN ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (DOWN ) ==> ibp77s0 (Down) 0000:5e:00.0 mlx5_4 (MT4129 - 1028SN ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (ACTIVE) ==> ibp94s0 (Down) 0000:9c:00.0 mlx5_5 (MT4129 - 1028SN ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (ACTIVE) ==> ibp156s0 (Down) 0000:9d:00.0 mlx5_6 (MT4123 - 1028SN ) Mellanox ConnectX-6 Single Port VPI HDR QSFP Adapter fw 20.38.1002 port 1 (ACTIVE) ==> bond-nas (Up) 0000:bc:00.0 mlx5_7 (MT4129 - 1028SN ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (DOWN ) ==> ibp188s0 (Down) 0000:cc:00.0 mlx5_8 (MT4129 - 1028SN ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (ACTIVE) ==> ibp204s0 (Down) 0000:dc:00.0 mlx5_9 (MT4129 - 1028SN ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (DOWN ) ==> ibp220s0 (Down)IB HCA 인터페이스 이름 확인 결과 예시
- 다음 예시에서 IB용 Port:
SERVER Side 명령어를 사용하여 상호 통신 상태를 확인하세요.
배경색 변경~$ ib_send_bw -d mlx5_0 -i 1 –F~$ ib_send_bw -d mlx5_0 -i 1 –FSERVER Side 명령어 예시 배경색 변경************************************ * Waiting for client to connect... * ************************************ --------------------------------------------------------------------------------------- Send BW Test Dual-port : OFF Device : mlx5_0 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON ibv_wr* API : ON TX depth : 128 CQ Moderation : 100 Mtu : 4096[B] Link type : IB Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0x0f QPN 0x6d95 PSN 0xb974a remote address: LID 0x01 QPN 0x6dd2 PSN 0xc8a18c --------------------------------------------------------------------------------------- #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps] 65536 1000 0.00 19827.40 0.317238 ---------------------------------------------------------------------------------------************************************ * Waiting for client to connect... * ************************************ --------------------------------------------------------------------------------------- Send BW Test Dual-port : OFF Device : mlx5_0 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON ibv_wr* API : ON TX depth : 128 CQ Moderation : 100 Mtu : 4096[B] Link type : IB Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0x0f QPN 0x6d95 PSN 0xb974a remote address: LID 0x01 QPN 0x6dd2 PSN 0xc8a18c --------------------------------------------------------------------------------------- #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps] 65536 1000 0.00 19827.40 0.317238 ---------------------------------------------------------------------------------------상호 통신 상태 확인 결과 예시 CLIENT Side 명령어를 사용하여 상호 통신 상태를 확인하세요.
배경색 변경~$ ib_send_bw -d mlx5_0 -i 1 -F <SERVER Side IP>~$ ib_send_bw -d mlx5_0 -i 1 -F <SERVER Side IP>CLIENT Side 명령어 예시 배경색 변경--------------------------------------------------------------------------------------- Send BW Test Dual-port : OFF Device : mlx5_0 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON ibv_wr* API : ON RX depth : 512 CQ Moderation : 100 Mtu : 4096[B] Link type : IB Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0x01 QPN 0x6dd2 PSN 0xc8a18c remote address: LID 0x0f QPN 0x6d95 PSN 0xb974a --------------------------------------------------------------------------------------- #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps] 65536 1000 19008.49 19006.37 0.304102 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ Send BW Test Dual-port : OFF Device : mlx5_0 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON ibv_wr* API : ON RX depth : 512 CQ Moderation : 100 Mtu : 4096[B] Link type : IB Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0x01 QPN 0x6dd2 PSN 0xc8a18c remote address: LID 0x0f QPN 0x6d95 PSN 0xb974a --------------------------------------------------------------------------------------- #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps] 65536 1000 19008.49 19006.37 0.304102 ---------------------------------------------------------------------------------------상호 통신 상태 확인 결과 예시
IB 서비스 유관 커널 모듈 확인
IB 서비스 유관 커널 모듈을 확인(lsmod)하여 IaaS HW 레벨을 점검하세요.
IB 서비스 유관 커널 모듈 확인 - nvidia_peermem
배경색 변경~$ lsmod | grep nvidia_peermem~$ lsmod | grep nvidia_peermemnvidia_peermem 확인 명령어 예시 배경색 변경nvidia_peermem 16384 0 ib_core 425984 9 rdma_cm,ib_ipoib,nvidia_peermem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm nvidia 56524800 451 nvidia_uvm,nvidia_peermem,nvidia_modesetnvidia_peermem 16384 0 ib_core 425984 9 rdma_cm,ib_ipoib,nvidia_peermem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm nvidia 56524800 451 nvidia_uvm,nvidia_peermem,nvidia_modesetnvidia_peermem 확인 결과 예시 IB 서비스 유관 커널 모듈 확인 - IB Card (HCA) driver
배경색 변경~$ lsmod|egrep 'ib_|_ib|mlx'~$ lsmod|egrep 'ib_|_ib|mlx'IB Card (HCA) driver 확인 명령어 예시 배경색 변경mlx5_ib 393216 0 ib_uverbs 163840 2 irdma,mlx5_ib ib_core 393216 3 irdma,ib_uverbs,mlx5_ib mlx5_core 1593344 1 mlx5_ib mlxfw 32768 1 mlx5_core psample 20480 1 mlx5_core tls 114688 1 mlx5_core pci_hyperv_intf 16384 1 mlx5_core ib_ipoib 139264 0 ib_cm 131072 2 rdma_cm,ib_ipoib ib_umad 40960 0 mlx5_ib 454656 0 ib_uverbs 135168 2 rdma_ucm,mlx5_ib ib_core 434176 9 rdma_cm,ib_ipoib,nvidia_peermem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm libcrc32c 16384 5 nf_conntrack,nf_nat,btrfs,nf_tables,raid456 mlx5_core 2064384 1 mlx5_ib mlx_compat 69632 11 rdma_cm,ib_ipoib,mlxdevm,iw_cm,ib_umad,ib_core,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm,mlx5_coremlx5_ib 393216 0 ib_uverbs 163840 2 irdma,mlx5_ib ib_core 393216 3 irdma,ib_uverbs,mlx5_ib mlx5_core 1593344 1 mlx5_ib mlxfw 32768 1 mlx5_core psample 20480 1 mlx5_core tls 114688 1 mlx5_core pci_hyperv_intf 16384 1 mlx5_core ib_ipoib 139264 0 ib_cm 131072 2 rdma_cm,ib_ipoib ib_umad 40960 0 mlx5_ib 454656 0 ib_uverbs 135168 2 rdma_ucm,mlx5_ib ib_core 434176 9 rdma_cm,ib_ipoib,nvidia_peermem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm libcrc32c 16384 5 nf_conntrack,nf_nat,btrfs,nf_tables,raid456 mlx5_core 2064384 1 mlx5_ib mlx_compat 69632 11 rdma_cm,ib_ipoib,mlxdevm,iw_cm,ib_umad,ib_core,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm,mlx5_coreIB Card (HCA) driver 확인 결과 예시
스토리지 물리 디스크 자원 및 Multi-Path 확인
스토리지 물리 디스크 자원 및 Multi-Path를 확인하여 IaaS HW 레벨을 점검하세요.
스토리지 물리 디스크 자원 확인 결과
배경색 변경root@bm-dev-002:/tmp# fdisk –lroot@bm-dev-002:/tmp# fdisk –l스토리지 물리 디스크 자원 확인 결과 예시 Multi-Path 확인 결과
배경색 변경root@bm-dev-002:/tmp# multipath –llroot@bm-dev-002:/tmp# multipath –llMulti-Path 확인 결과 예시
Multi-node GPU Cluster 신규 배포 후 Service Network 확인
다음 명령어를 이용하여 Bonding 및 Slave Interface의 MII Status가 up인지 확인하세요.
Service Network 확인 명령어
배경색 변경~$ cat /proc/net/bonding/bond-srv~$ cat /proc/net/bonding/bond-srvService Network 확인 명령어 예시 Service Network 확인 결과
배경색 변경Ethernet Channel Bonding Driver: v5.15.0-25-generic Bonding Mode: fault-tolerance (active-backup) Primary Slave: None Currently Active Slave: ens9f0 MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 0 Down Delay (ms): 0 Peer Notification Delay (ms): 0 Slave Interface: ens9f0 MII Status: up Speed: 100000 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: 30:3e:a7:02:35:70 Slave queue ID: 0 Slave Interface: ens11f0 MII Status: up Speed: 100000 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: 30:3e:a7:02:2f:e8 Slave queue ID: 0Ethernet Channel Bonding Driver: v5.15.0-25-generic Bonding Mode: fault-tolerance (active-backup) Primary Slave: None Currently Active Slave: ens9f0 MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 0 Down Delay (ms): 0 Peer Notification Delay (ms): 0 Slave Interface: ens9f0 MII Status: up Speed: 100000 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: 30:3e:a7:02:35:70 Slave queue ID: 0 Slave Interface: ens11f0 MII Status: up Speed: 100000 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: 30:3e:a7:02:2f:e8 Slave queue ID: 0Service Network 확인 결과 예시
Multi-node GPU Cluster 신규 배포 후 Time Server와 시간 동기화 확인
OS 이미지에는 chrony 데몬 설치 및 SCP NTP 서버 동기화가 설정되어 있습니다. 다음 명령어를 이용하여 MS Name 열에 ^*로 표기된 라인이 있는지 확인하세요.
chrony 데몬 source 및 동기화 상태 확인 명령어
배경색 변경~$ chronyc sources -V~$ chronyc sources -V상태 확인 명령어 예시 chrony 데몬 상태 확인 결과
배경색 변경MS Name/IP address Stratum Poll Reach LastRx Last sample =============================================================================== ^* 198.19.0.54 4 10 377 1040 -16us[ -37us] +/- 9982usMS Name/IP address Stratum Poll Reach LastRx Last sample =============================================================================== ^* 198.19.0.54 4 10 377 1040 -16us[ -37us] +/- 9982uschrony 데몬 상태 확인 결과 예시