Multi-node GPU Cluster 서비스 범위 및 점검 가이드
Multi-node GPU Cluster 서비스 범위
Multi-node GPU Cluster 서비스의 IaaS HW 레벨 문제 발생 시 Support Center의 문의하기를 통해 기술 지원을 받을 수 있습니다. 하지만 OS Kernel 업데이트 또는 애플리케이션 설치 등의 변경에 따른 리스크는 사용자의 영역이므로 기술 지원이 어려우니, 시스템 업데이트 등의 작업에 유의해 주시기 바랍니다.
IaaS HW 레벨 문제
- IPMI HW 모니터링 콘솔에서 발생하는 서버 내의 HW fault event 발생 메시지
- nvdia-smi 명령에서 확인되는 GPU HW 동작 오류
- InfiniBand HCA 카드 또는 InfiniBand Switch 점검에서 발생하는 HW 오류 메시지
IaaS HW 점검 가이드
Multi-node GPU Cluster 서비스를 신청한 후에는 점검 가이드에 따라 IaaS HW 레벨을 점검하는 것을 권장합니다.
Intel E810 드라이버 업데이트
Intel E810 드라이버의 버전을 확인하고 필요 시 다음 절차를 참고하여 업데이트를 진행하세요.
- 서버 제조사 Intel E810 드라이버 최소 권장 버전: 1.15.4 이상
- 드라이버 다운로드: Intel Network Adapter Driver for 800 Series Devices under Linux
lspci 명령을 이용하여 E810 NIC 장치의 유무를 확인할 수 있습니다.
E810 NIC가 유효한 경우, E810-C 정보로 다음과 같이 PCIe 장치가 확인됩니다(E810 NIC 장치가 없을 경우에는 해당 작업 미수행)
lspci | grep E810
0000:6a:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-C for QSFP (rev 02) 0000:6a:00.1 Ethernet controller: Intel Corporation Ethernet Controller E810-C for QSFP (rev 02)
드라이버를 업데이트하려면 다음 절차를 따르세요.
- 기본 드라이버 tar 파일을 원하는 디렉토리로 이동하세요.
예시: /usr/local/src
- Archiver 파일을 untar / unzip하세요.
- x.x.x는 드라이버 tar 파일의 버전 번호입니다.
tar zxf ice-x.x.x.tar.gz
- 드라이버 src 디렉토리로 변경하세요.
- x.x.x는 드라이버 tar 파일의 버전 번호입니다.
cd ice-x.x.x/src/
- 드라이버 모듈을 컴파일하세요.
make install
- 업데이트가 끝난 후, 버전을 확인하세요.
lsmod | grep ice
modinfo ice | grep version
NVIDIA driver 확인
NVIDIA driver를 확인(nvidia-smi topo, IB nv_peer_mem status)하여 IaaS HW 레벨을 점검하려면 다음 절차를 따르세요.
- GPU 드라이버 상태를 확인하세요.
~$ nvidia-smi
Thu Jan 29 14:48:31 2026
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.06 Driver Version: 535.183.06 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-SXM4-80GB On | 00000000:00:05.0 Off | On |
| N/A 36C P0 52W / 400W | 0MiB / 81920MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM4-80GB On | 00000000:00:06.0 Off | 0 |
| N/A 36C P0 61W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-SXM4-80GB On | 00000000:00:07.0 Off | 0 |
| N/A 36C P0 64W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-SXM4-80GB On | 00000000:00:08.0 Off | 0 |
| N/A 40C P0 64W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 4 NVIDIA A100-SXM4-80GB On | 00000000:00:09.0 Off | 0 |
| N/A 36C P0 63W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 5 NVIDIA A100-SXM4-80GB On | 00000000:00:0A.0 Off | 0 |
| N/A 40C P0 64W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 6 NVIDIA A100-SXM4-80GB On | 00000000:00:0B.0 Off | 0 |
| N/A 39C P0 65W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 7 NVIDIA A100-SXM4-80GB On | 00000000:00:0C.0 Off | 0 |
| N/A 39C P0 60W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| MIG devices: |
+------------------+--------------------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG |
| | | ECC| |
|==================+================================+===========+=======================|
| No MIG devices found |
+---------------------------------------------------------------------------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
- NVSwitch 및 NVLink HW 상태를 확인하세요.
- NVSwitch 상태 확인
~$ nvidia-smi nvlink --status
GPU 1: NVIDIA A100-SXM4-80GB (UUID: GPU-64a2f685-bb12-c4af-105c-0726ece9c8d7)
Link 0: 25 GB/s
Link 1: 25 GB/s
Link 2: 25 GB/s
Link 3: 25 GB/s
Link 4: 25 GB/s
Link 5: 25 GB/s
Link 6: 25 GB/s
Link 7: 25 GB/s
Link 8: 25 GB/s
Link 9: 25 GB/s
Link 10: 25 GB/s
Link 11: 25 GB/s
GPU 2: NVIDIA A100-SXM4-80GB (UUID: GPU-2269851b-71cd-f6c7-50c5-ba1525cf3ce8)
Link 0: 25 GB/s
Link 1: 25 GB/s
Link 2: 25 GB/s
Link 3: 25 GB/s
Link 4: 25 GB/s
Link 5: 25 GB/s
Link 6: 25 GB/s
Link 7: 25 GB/s
Link 8: 25 GB/s
Link 9: 25 GB/s
Link 10: 25 GB/s
Link 11: 25 GB/s
GPU 3: NVIDIA A100-SXM4-80GB (UUID: GPU-4c397bbf-95fc-5c29-918a-a429cbe45a7a)
Link 0: 25 GB/s
Link 1: 25 GB/s
Link 2: 25 GB/s
Link 3: 25 GB/s
Link 4: 25 GB/s
Link 5: 25 GB/s
Link 6: 25 GB/s
Link 7: 25 GB/s
Link 8: 25 GB/s
Link 9: 25 GB/s
Link 10: 25 GB/s
Link 11: 25 GB/s
GPU 4: NVIDIA A100-SXM4-80GB (UUID: GPU-0e350204-9fb6-2cbe-538e-8f7849658eb8)
Link 0: 25 GB/s
Link 1: 25 GB/s
Link 2: 25 GB/s
Link 3: 25 GB/s
Link 4: 25 GB/s
Link 5: 25 GB/s
Link 6: 25 GB/s
Link 7: 25 GB/s
Link 8: 25 GB/s
Link 9: 25 GB/s
Link 10: 25 GB/s
Link 11: 25 GB/s
GPU 5: NVIDIA A100-SXM4-80GB (UUID: GPU-45f0c453-4760-edd4-3af9-25c5ea7473a5)
Link 0: 25 GB/s
Link 1: 25 GB/s
Link 2: 25 GB/s
Link 3: 25 GB/s
Link 4: 25 GB/s
Link 5: 25 GB/s
Link 6: 25 GB/s
Link 7: 25 GB/s
Link 8: 25 GB/s
Link 9: 25 GB/s
Link 10: 25 GB/s
Link 11: 25 GB/s
GPU 6: NVIDIA A100-SXM4-80GB (UUID: GPU-38409794-bb34-430e-3c50-90b42cb2bb72)
Link 0: 25 GB/s
Link 1: 25 GB/s
Link 2: 25 GB/s
Link 3: 25 GB/s
Link 4: 25 GB/s
Link 5: 25 GB/s
Link 6: 25 GB/s
Link 7: 25 GB/s
Link 8: 25 GB/s
Link 9: 25 GB/s
Link 10: 25 GB/s
Link 11: 25 GB/s
GPU 7: NVIDIA A100-SXM4-80GB (UUID: GPU-3fb478aa-801b-eb64-55c2-0ffc3f2ce404)
Link 0: 25 GB/s
Link 1: 25 GB/s
Link 2: 25 GB/s
Link 3: 25 GB/s
Link 4: 25 GB/s
Link 5: 25 GB/s
Link 6: 25 GB/s
Link 7: 25 GB/s
Link 8: 25 GB/s
Link 9: 25 GB/s
Link 10: 25 GB/s
Link 11: 25 GB/s
* NVLinks HW 상태 확인
~$ nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV12 NV12 NV12 NV12 NV12 NV12 NV12 0-127 0-7 N/A
GPU1 NV12 X NV12 NV12 NV12 NV12 NV12 NV12 0-127 0-7 N/A
GPU2 NV12 NV12 X NV12 NV12 NV12 NV12 NV12 0-127 0-7 N/A
GPU3 NV12 NV12 NV12 X NV12 NV12 NV12 NV12 0-127 0-7 N/A
GPU4 NV12 NV12 NV12 NV12 X NV12 NV12 NV12 0-127 0-7 N/A
GPU5 NV12 NV12 NV12 NV12 NV12 X NV12 NV12 0-127 0-7 N/A
GPU6 NV12 NV12 NV12 NV12 NV12 NV12 X NV12 0-127 0-7 N/A
GPU7 NV12 NV12 NV12 NV12 NV12 NV12 NV12 X 0-127 0-7 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
- InfiniBand(IB) HCA 카드 HW 상태와 Link를 확인하세요.
user@bm-dev-001:~$ ibdev2netdev -v
cat: /sys/class/infiniband/mlx5_0/device/vpd: Permission denied
0000:45:00.0 mlx5_0 (MT4123 - ) fw 20.29.1016 port 1 (ACTIVE) ==> ibs18 (Down)
cat: /sys/class/infiniband/mlx5_1/device/vpd: Permission denied
0000:0e:00.0 mlx5_1 (MT4123 - ) fw 20.29.1016 port 1 (ACTIVE) ==> ibs17 (Down)
cat: /sys/class/infiniband/mlx5_2/device/vpd: Permission denied
0000:c5:00.0 mlx5_2 (MT4123 - ) fw 20.29.1016 port 1 (ACTIVE) ==> ibs20 (Down)
cat: /sys/class/infiniband/mlx5_3/device/vpd: Permission denied
0000:85:00.0 mlx5_3 (MT4123 - ) fw 20.29.1016 port 1 (ACTIVE) ==> ibs19 (Down)
user@bm-dev-001:~$
root@bm-dev-001:~# ibstat
CA 'mlx5_0'
CA type: MT4123
Number of ports: 1
Firmware version: 20.29.1016
Hardware version: 0
Node GUID: 0x88e9a4ffff5060ac
System image GUID: 0x88e9a4ffff5060ac
Port 1:
State: Active
Physical state: LinkUp
Rate: 200
Base lid: 8
LMC: 0
SM lid: 1
Capability mask: 0x2651e848
Port GUID: 0x88e9a4ffff5060ac
Link layer: InfiniBand
CA 'mlx5_1'
CA type: MT4123
Number of ports: 1
Firmware version: 20.29.1016
Hardware version: 0
Node GUID: 0x88e9a4ffff504080
System image GUID: 0x88e9a4ffff504080
Port 1:
State: Active
Physical state: LinkUp
Rate: 200
Base lid: 5
LMC: 0
SM lid: 1
Capability mask: 0x2651e848
Port GUID: 0x88e9a4ffff504080
Link layer: InfiniBand
CA 'mlx5_2'
CA type: MT4123
Number of ports: 1
Firmware version: 20.29.1016
Hardware version: 0
Node GUID: 0x88e9a4ffff505038
System image GUID: 0x88e9a4ffff505038
Port 1:
State: Active
Physical state: LinkUp
Rate: 200
Base lid: 2
LMC: 0
SM lid: 1
Capability mask: 0x2651e848
Port GUID: 0x88e9a4ffff505038
Link layer: InfiniBand
CA 'mlx5_3'
CA type: MT4123
Number of ports: 1
Firmware version: 20.29.1016
Hardware version: 0
Node GUID: 0x88e9a4ffff504094
System image GUID: 0x88e9a4ffff504094
Port 1:
State: Active
Physical state: LinkUp
Rate: 200
Base lid: 7
LMC: 0
SM lid: 1
Capability mask: 0x2651e848
Port GUID: 0x88e9a4ffff504094
Link layer: InfiniBand
IB bandwidth 통신 확인
IB bandwidth 통신 상태를 확인(ib_send_bw)하여 IaaS HW 레벨을 점검하려면 다음 절차를 따르세요.
- IB HCA 인터페이스의 이름을 확인하세요.
- 다음 예시에서 IB용 Port:
mlx5_0,mlx5_4,mlx_5_5,mlx5_8
- 다음 예시에서 IB용 Port:
~$ ibdev2netdev -v
0000:1a:00.0 mlx5_0 (MT4129 - 1028SN ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (ACTIVE) ==> ibp26s0 (Down)
0000:1b:00.0 mlx5_1 (MT4123 - 1028SN ) Mellanox ConnectX-6 Single Port VPI HDR QSFP Adapter fw 20.38.1002 port 1 (ACTIVE) ==> bond-nas (Up)
0000:3c:00.0 mlx5_2 (MT4129 - 1028SN ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (DOWN ) ==> ibp60s0 (Down)
0000:4d:00.0 mlx5_3 (MT4129 - 1028SN ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (DOWN ) ==> ibp77s0 (Down)
0000:5e:00.0 mlx5_4 (MT4129 - 1028SN ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (ACTIVE) ==> ibp94s0 (Down)
0000:9c:00.0 mlx5_5 (MT4129 - 1028SN ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (ACTIVE) ==> ibp156s0 (Down)
0000:9d:00.0 mlx5_6 (MT4123 - 1028SN ) Mellanox ConnectX-6 Single Port VPI HDR QSFP Adapter fw 20.38.1002 port 1 (ACTIVE) ==> bond-nas (Up)
0000:bc:00.0 mlx5_7 (MT4129 - 1028SN ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (DOWN ) ==> ibp188s0 (Down)
0000:cc:00.0 mlx5_8 (MT4129 - 1028SN ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (ACTIVE) ==> ibp204s0 (Down)
0000:dc:00.0 mlx5_9 (MT4129 - 1028SN ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (DOWN ) ==> ibp220s0 (Down)
- SERVER Side 명령어를 사용하여 상호 통신 상태를 확인하세요.
~$ ib_send_bw -d mlx5_0 -i 1 –F
************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
Send BW Test
Dual-port : OFF Device : mlx5_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON
ibv_wr* API : ON
TX depth : 128
CQ Moderation : 100
Mtu : 4096[B]
Link type : IB
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0x0f QPN 0x6d95 PSN 0xb974a
remote address: LID 0x01 QPN 0x6dd2 PSN 0xc8a18c
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
65536 1000 0.00 19827.40 0.317238
---------------------------------------------------------------------------------------
- CLIENT Side 명령어를 사용하여 상호 통신 상태를 확인하세요.
~$ ib_send_bw -d mlx5_0 -i 1 -F <SERVER Side IP>
---------------------------------------------------------------------------------------
Send BW Test
Dual-port : OFF Device : mlx5_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON
ibv_wr* API : ON
RX depth : 512
CQ Moderation : 100
Mtu : 4096[B]
Link type : IB
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0x01 QPN 0x6dd2 PSN 0xc8a18c
remote address: LID 0x0f QPN 0x6d95 PSN 0xb974a
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
65536 1000 19008.49 19006.37 0.304102
---------------------------------------------------------------------------------------
IB 서비스 유관 커널 모듈 확인
IB 서비스 유관 커널 모듈을 확인(lsmod)하여 IaaS HW 레벨을 점검하세요.
- IB 서비스 유관 커널 모듈 확인 - nvidia_peermem
~$ lsmod | grep nvidia_peermem
nvidia_peermem 16384 0
ib_core 425984 9 rdma_cm,ib_ipoib,nvidia_peermem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
nvidia 56524800 451 nvidia_uvm,nvidia_peermem,nvidia_modeset
- IB 서비스 유관 커널 모듈 확인 - IB Card (HCA) driver
~$ lsmod|egrep 'ib_|_ib|mlx'
mlx5_ib 393216 0
ib_uverbs 163840 2 irdma,mlx5_ib
ib_core 393216 3 irdma,ib_uverbs,mlx5_ib
mlx5_core 1593344 1 mlx5_ib
mlxfw 32768 1 mlx5_core
psample 20480 1 mlx5_core
tls 114688 1 mlx5_core
pci_hyperv_intf 16384 1 mlx5_core
ib_ipoib 139264 0
ib_cm 131072 2 rdma_cm,ib_ipoib
ib_umad 40960 0
mlx5_ib 454656 0
ib_uverbs 135168 2 rdma_ucm,mlx5_ib
ib_core 434176 9 rdma_cm,ib_ipoib,nvidia_peermem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
libcrc32c 16384 5 nf_conntrack,nf_nat,btrfs,nf_tables,raid456
mlx5_core 2064384 1 mlx5_ib
mlx_compat 69632 11 rdma_cm,ib_ipoib,mlxdevm,iw_cm,ib_umad,ib_core,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm,mlx5_core
스토리지 물리 디스크 자원 및 Multi-Path 확인
스토리지 물리 디스크 자원 및 Multi-Path를 확인하여 IaaS HW 레벨을 점검하세요.
- 스토리지 물리 디스크 자원 확인 결과
root@bm-dev-002:/tmp# fdisk –l
- Multi-Path 확인 결과
root@bm-dev-002:/tmp# multipath –ll
Multi-node GPU Cluster 신규 배포 후 Service Network 확인
다음 명령어를 이용하여 Bonding 및 Slave Interface의 MII Status가 up인지 확인하세요.
- 명령어
~$ cat /proc/net/bonding/bond-srv
- Service Network 확인 결과
Ethernet Channel Bonding Driver: v5.15.0-25-generic
Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: ens9f0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0
Peer Notification Delay (ms): 0
Slave Interface: ens9f0
MII Status: up
Speed: 100000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 30:3e:a7:02:35:70
Slave queue ID: 0
Slave Interface: ens11f0
MII Status: up
Speed: 100000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 30:3e:a7:02:2f:e8
Slave queue ID: 0
Multi-node GPU Cluster 신규 배포 후 Time Server와 시간 동기화 확인
OS 이미지에는 chrony 데몬 설치 및 SCP NTP 서버 동기화가 설정되어 있습니다. 다음 명령어를 이용하여 MS Name 열에 ^*로 표기된 라인이 있는지 확인하세요.
- chrony 데몬 source 및 동기화 상태 확인 명령어
~$ chronyc sources -V
- chrony 데몬 상태 확인 결과
MS Name/IP address Stratum Poll Reach LastRx Last sample
===============================================================================
^* 198.19.0.54 4 10 377 1040 -16us[ -37us] +/- 9982us