Multi-node GPU Cluster 서비스 범위 및 점검 가이드

Multi-node GPU Cluster 서비스 범위

Multi-node GPU Cluster 서비스의 IaaS HW 레벨 문제 발생 시 Support Center의 문의하기를 통해 기술 지원을 받을 수 있습니다. 하지만 OS Kernel 업데이트 또는 애플리케이션 설치 등의 변경에 따른 리스크는 사용자의 영역이므로 기술 지원이 어려우니, 시스템 업데이트 등의 작업에 유의해 주시기 바랍니다.

IaaS HW 레벨 문제

  • IPMI HW 모니터링 콘솔에서 발생하는 서버 내의 HW fault event 발생 메시지
  • nvdia-smi 명령에서 확인되는 GPU HW 동작 오류
  • InfiniBand HCA 카드 또는 InfiniBand Switch 점검에서 발생하는 HW 오류 메시지
주의
Multi-node GPU Cluster 는 Ubuntu OS / NVDIA / Infiniband 의 Software Version 호환성에 민감한 서비스이므로, 사용자의 OS Kernel 업데이트 또는 애플리케이션 설치 등의 변경 이후 공식적인 기술지원이 불가합니다.

IaaS HW 점검 가이드

Multi-node GPU Cluster 서비스를 신청한 후에는 점검 가이드에 따라 IaaS HW 레벨을 점검하는 것을 권장합니다.

Intel E810 드라이버 업데이트

Intel E810 드라이버의 버전을 확인하고 필요 시 다음 절차를 참고하여 업데이트를 진행하세요.

주의
Multi-node GPU Cluster Node가 Intel E810 Device를 가지며 동시에 535.86.10 버전 이하의 표준 이미지를 사용하는 경우에만 업데이트를 진행하세요.
참고

lspci 명령을 이용하여 E810 NIC 장치의 유무를 확인할 수 있습니다.
E810 NIC가 유효한 경우, E810-C 정보로 다음과 같이 PCIe 장치가 확인됩니다(E810 NIC 장치가 없을 경우에는 해당 작업 미수행)

<div class="code-block-buttons">
  
  <button class="code-block-download">
    <div class="code-block-download-icon">
      <svg width="14" height="16" viewBox="0 0 14 16" fill="none" xmlns="http://www.w3.org/2000/svg"><path d="M13.999 14.4353v-2.998C13.999 11.0232 13.6602 10.6853 13.2461 10.6853S12.5 11.0232 12.5 11.4373v2.2519H1.5V11.4373C1.5 11.0232 1.16211 10.6853.748047 10.6853.333984 10.6853 976563e-9 11.0232 976562e-9 11.4373v2.998C976562e-9 14.8494.333984 15.1892.748047 15.1892H13.2461c.414099999999999.0.7529-.3398.7529-.7539z" fill="#5135ff"/><path d="M1.41169 6.21654c.25876-.32345.73073-.37589 1.05417-.11713l3.78463 3.0277V1.56104c0-.41422.33578-.750005.75-.750005.41421.0.75.335785.75.750005V9.12208L11.5288 6.09941C11.8523 5.84065 12.3242 5.89309 12.583 6.21654 12.8418 6.53999 12.7893 7.01196 12.4659 7.27071L6.99734 11.6455 1.52882 7.27071c-.32345-.25875-.37589-.73072-.11713-1.05417z" fill="#5135ff"/></svg></div>
    Sample Code Download
  </button>
  
  <button class="code-block-copy">
    <div class="code-block-copy-icon">
      <svg width="14" height="14" viewBox="0 0 14 14" fill="none" xmlns="http://www.w3.org/2000/svg"><path fill-rule="evenodd" clip-rule="evenodd" d="M10 4.00012V1c0-.552285-.44772-1-1-1H1C.447715.0.0.447715.0 1V9c0 .55228.447715 1 1 1H3.99988v3.0001C3.99988 13.5515 4.44849 14 4.99988 14H10.0002l3.9997-3.9995V5c0-.55139-.448600000000001-.99988-1-.99988H10zM1.4 1.4V8.6H3.99988V5c0-.55139.44861-.99988 1-.99988H8.6V1.4H1.4zM5.3999 12.6H9.08295V9.78301C9.08295 9.39641 9.39635 9.08301 9.78295 9.08301H12.5999V5.40015h-7.2V12.6zm6.1375-2.117-1.0545 1.0544V10.483h1.0545z" fill="currentcolor"/></svg></div>
    Copy Code
  </button>
</div>
배경색 변경
lspci | grep E810
0000:6a:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-C for QSFP (rev 02) 0000:6a:00.1 Ethernet controller: Intel Corporation Ethernet Controller E810-C for QSFP (rev 02)
lspci | grep E810
0000:6a:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-C for QSFP (rev 02) 0000:6a:00.1 Ethernet controller: Intel Corporation Ethernet Controller E810-C for QSFP (rev 02)
</div>
E810 NIC 장치의 유무 확인 예시

드라이버를 업데이트하려면 다음 절차를 따르세요.

  1. 기본 드라이버 tar 파일을 원하는 디렉토리로 이동하세요.

    배경색 변경
    /usr/local/src
    /usr/local/src
    디렉토리 이동 예시

  2. Archiver 파일을 untar / unzip하세요.

    • x.x.x는 드라이버 tar 파일의 버전 번호입니다.
      배경색 변경
      tar zxf ice-x.x.x.tar.gz
      tar zxf ice-x.x.x.tar.gz
      untar 예시
  3. 드라이버 src 디렉토리로 변경하세요.

    • x.x.x는 드라이버 tar 파일의 버전 번호입니다.
      배경색 변경
      cd ice-x.x.x/src/
      cd ice-x.x.x/src/
      src 디렉토리로 변경 예시
  4. 드라이버 모듈을 컴파일하세요.

    배경색 변경
    make install
    make install
    드라이버 모듈 컴파일 예시

  5. 업데이트가 끝난 후, 버전을 확인하세요.

    배경색 변경
    lsmod | grep ice
    modinfo ice | grep version
    lsmod | grep ice
    modinfo ice | grep version
    버전 확인 예시

NVIDIA driver 확인

NVIDIA driver를 확인(nvidia-smi topo, IB nv_peer_mem status)하여 IaaS HW 레벨을 점검하려면 다음 절차를 따르세요.

참고
MIG 적용 방법에 대한 예시는 A100 GPU Node를 기준으로 설명합니다.
  1. GPU 드라이버 상태를 확인하세요.

    배경색 변경
    ~$ nvidia-smi
    ~$ nvidia-smi
    GPU 드라이버 상태 확인 코드 예시
    배경색 변경
    Thu Jan 29 14:48:31 2026
    +---------------------------------------------------------------------------------------+
    | NVIDIA-SMI 535.183.06             Driver Version: 535.183.06   CUDA Version: 12.2     |
    |-----------------------------------------+----------------------+----------------------+
    | GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
    |                                         |                      |               MIG M. |
    |=========================================+======================+======================|
    |   0  NVIDIA A100-SXM4-80GB          On  | 00000000:00:05.0 Off |                   On |
    | N/A   36C    P0              52W / 400W |      0MiB / 81920MiB |     N/A      Default |
    |                                         |                      |              Enabled |
    +-----------------------------------------+----------------------+----------------------+
    |   1  NVIDIA A100-SXM4-80GB          On  | 00000000:00:06.0 Off |                    0 |
    | N/A   36C    P0              61W / 400W |      0MiB / 81920MiB |      0%      Default |
    |                                         |                      |             Disabled |
    +-----------------------------------------+----------------------+----------------------+
    |   2  NVIDIA A100-SXM4-80GB          On  | 00000000:00:07.0 Off |                    0 |
    | N/A   36C    P0              64W / 400W |      0MiB / 81920MiB |      0%      Default |
    |                                         |                      |             Disabled |
    +-----------------------------------------+----------------------+----------------------+
    |   3  NVIDIA A100-SXM4-80GB          On  | 00000000:00:08.0 Off |                    0 |
    | N/A   40C    P0              64W / 400W |      0MiB / 81920MiB |      0%      Default |
    |                                         |                      |             Disabled |
    +-----------------------------------------+----------------------+----------------------+
    |   4  NVIDIA A100-SXM4-80GB          On  | 00000000:00:09.0 Off |                    0 |
    | N/A   36C    P0              63W / 400W |      0MiB / 81920MiB |      0%      Default |
    |                                         |                      |             Disabled |
    +-----------------------------------------+----------------------+----------------------+
    |   5  NVIDIA A100-SXM4-80GB          On  | 00000000:00:0A.0 Off |                    0 |
    | N/A   40C    P0              64W / 400W |      0MiB / 81920MiB |      0%      Default |
    |                                         |                      |             Disabled |
    +-----------------------------------------+----------------------+----------------------+
    |   6  NVIDIA A100-SXM4-80GB          On  | 00000000:00:0B.0 Off |                    0 |
    | N/A   39C    P0              65W / 400W |      0MiB / 81920MiB |      0%      Default |
    |                                         |                      |             Disabled |
    +-----------------------------------------+----------------------+----------------------+
    |   7  NVIDIA A100-SXM4-80GB          On  | 00000000:00:0C.0 Off |                    0 |
    | N/A   39C    P0              60W / 400W |      0MiB / 81920MiB |      0%      Default |
    |                                         |                      |             Disabled |
    +-----------------------------------------+----------------------+----------------------+
    
    +---------------------------------------------------------------------------------------+
    | MIG devices:                                                                          |
    +------------------+--------------------------------+-----------+-----------------------+
    | GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |
    |      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
    |                  |                                |        ECC|                       |
    |==================+================================+===========+=======================|
    |  No MIG devices found                                                                 |
    +---------------------------------------------------------------------------------------+
    
    +---------------------------------------------------------------------------------------+
    | Processes:                                                                            |
    |  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
    |        ID   ID                                                             Usage      |
    |=======================================================================================|
    |  No running processes found                                                           |
    +---------------------------------------------------------------------------------------+
    Thu Jan 29 14:48:31 2026
    +---------------------------------------------------------------------------------------+
    | NVIDIA-SMI 535.183.06             Driver Version: 535.183.06   CUDA Version: 12.2     |
    |-----------------------------------------+----------------------+----------------------+
    | GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
    |                                         |                      |               MIG M. |
    |=========================================+======================+======================|
    |   0  NVIDIA A100-SXM4-80GB          On  | 00000000:00:05.0 Off |                   On |
    | N/A   36C    P0              52W / 400W |      0MiB / 81920MiB |     N/A      Default |
    |                                         |                      |              Enabled |
    +-----------------------------------------+----------------------+----------------------+
    |   1  NVIDIA A100-SXM4-80GB          On  | 00000000:00:06.0 Off |                    0 |
    | N/A   36C    P0              61W / 400W |      0MiB / 81920MiB |      0%      Default |
    |                                         |                      |             Disabled |
    +-----------------------------------------+----------------------+----------------------+
    |   2  NVIDIA A100-SXM4-80GB          On  | 00000000:00:07.0 Off |                    0 |
    | N/A   36C    P0              64W / 400W |      0MiB / 81920MiB |      0%      Default |
    |                                         |                      |             Disabled |
    +-----------------------------------------+----------------------+----------------------+
    |   3  NVIDIA A100-SXM4-80GB          On  | 00000000:00:08.0 Off |                    0 |
    | N/A   40C    P0              64W / 400W |      0MiB / 81920MiB |      0%      Default |
    |                                         |                      |             Disabled |
    +-----------------------------------------+----------------------+----------------------+
    |   4  NVIDIA A100-SXM4-80GB          On  | 00000000:00:09.0 Off |                    0 |
    | N/A   36C    P0              63W / 400W |      0MiB / 81920MiB |      0%      Default |
    |                                         |                      |             Disabled |
    +-----------------------------------------+----------------------+----------------------+
    |   5  NVIDIA A100-SXM4-80GB          On  | 00000000:00:0A.0 Off |                    0 |
    | N/A   40C    P0              64W / 400W |      0MiB / 81920MiB |      0%      Default |
    |                                         |                      |             Disabled |
    +-----------------------------------------+----------------------+----------------------+
    |   6  NVIDIA A100-SXM4-80GB          On  | 00000000:00:0B.0 Off |                    0 |
    | N/A   39C    P0              65W / 400W |      0MiB / 81920MiB |      0%      Default |
    |                                         |                      |             Disabled |
    +-----------------------------------------+----------------------+----------------------+
    |   7  NVIDIA A100-SXM4-80GB          On  | 00000000:00:0C.0 Off |                    0 |
    | N/A   39C    P0              60W / 400W |      0MiB / 81920MiB |      0%      Default |
    |                                         |                      |             Disabled |
    +-----------------------------------------+----------------------+----------------------+
    
    +---------------------------------------------------------------------------------------+
    | MIG devices:                                                                          |
    +------------------+--------------------------------+-----------+-----------------------+
    | GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |
    |      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
    |                  |                                |        ECC|                       |
    |==================+================================+===========+=======================|
    |  No MIG devices found                                                                 |
    +---------------------------------------------------------------------------------------+
    
    +---------------------------------------------------------------------------------------+
    | Processes:                                                                            |
    |  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
    |        ID   ID                                                             Usage      |
    |=======================================================================================|
    |  No running processes found                                                           |
    +---------------------------------------------------------------------------------------+
    GPU 드라이버 상태 예시

  2. NVSwitch 및 NVLink HW 상태를 확인하세요.

    • NVSwitch 상태 확인

      배경색 변경
      ~$ nvidia-smi nvlink --status
      ~$ nvidia-smi nvlink --status
      NVSwitch 상태 확인 예시
      배경색 변경
      GPU 1: NVIDIA A100-SXM4-80GB (UUID: GPU-64a2f685-bb12-c4af-105c-0726ece9c8d7)
               Link 0: 25 GB/s
               Link 1: 25 GB/s
               Link 2: 25 GB/s
               Link 3: 25 GB/s
               Link 4: 25 GB/s
               Link 5: 25 GB/s
               Link 6: 25 GB/s
               Link 7: 25 GB/s
               Link 8: 25 GB/s
               Link 9: 25 GB/s
               Link 10: 25 GB/s
               Link 11: 25 GB/s
      GPU 2: NVIDIA A100-SXM4-80GB (UUID: GPU-2269851b-71cd-f6c7-50c5-ba1525cf3ce8)
               Link 0: 25 GB/s
               Link 1: 25 GB/s
               Link 2: 25 GB/s
               Link 3: 25 GB/s
               Link 4: 25 GB/s
               Link 5: 25 GB/s
               Link 6: 25 GB/s
               Link 7: 25 GB/s
               Link 8: 25 GB/s
               Link 9: 25 GB/s
               Link 10: 25 GB/s
               Link 11: 25 GB/s
      GPU 3: NVIDIA A100-SXM4-80GB (UUID: GPU-4c397bbf-95fc-5c29-918a-a429cbe45a7a)
               Link 0: 25 GB/s
               Link 1: 25 GB/s
               Link 2: 25 GB/s
               Link 3: 25 GB/s
               Link 4: 25 GB/s
               Link 5: 25 GB/s
               Link 6: 25 GB/s
               Link 7: 25 GB/s
               Link 8: 25 GB/s
               Link 9: 25 GB/s
               Link 10: 25 GB/s
               Link 11: 25 GB/s
      GPU 4: NVIDIA A100-SXM4-80GB (UUID: GPU-0e350204-9fb6-2cbe-538e-8f7849658eb8)
               Link 0: 25 GB/s
               Link 1: 25 GB/s
               Link 2: 25 GB/s
               Link 3: 25 GB/s
               Link 4: 25 GB/s
               Link 5: 25 GB/s
               Link 6: 25 GB/s
               Link 7: 25 GB/s
               Link 8: 25 GB/s
               Link 9: 25 GB/s
               Link 10: 25 GB/s
               Link 11: 25 GB/s
      GPU 5: NVIDIA A100-SXM4-80GB (UUID: GPU-45f0c453-4760-edd4-3af9-25c5ea7473a5)
               Link 0: 25 GB/s
               Link 1: 25 GB/s
               Link 2: 25 GB/s
               Link 3: 25 GB/s
               Link 4: 25 GB/s
               Link 5: 25 GB/s
               Link 6: 25 GB/s
               Link 7: 25 GB/s
               Link 8: 25 GB/s
               Link 9: 25 GB/s
               Link 10: 25 GB/s
               Link 11: 25 GB/s
      GPU 6: NVIDIA A100-SXM4-80GB (UUID: GPU-38409794-bb34-430e-3c50-90b42cb2bb72)
               Link 0: 25 GB/s
               Link 1: 25 GB/s
               Link 2: 25 GB/s
               Link 3: 25 GB/s
               Link 4: 25 GB/s
               Link 5: 25 GB/s
               Link 6: 25 GB/s
               Link 7: 25 GB/s
               Link 8: 25 GB/s
               Link 9: 25 GB/s
               Link 10: 25 GB/s
               Link 11: 25 GB/s
      GPU 7: NVIDIA A100-SXM4-80GB (UUID: GPU-3fb478aa-801b-eb64-55c2-0ffc3f2ce404)
               Link 0: 25 GB/s
               Link 1: 25 GB/s
               Link 2: 25 GB/s
               Link 3: 25 GB/s
               Link 4: 25 GB/s
               Link 5: 25 GB/s
               Link 6: 25 GB/s
               Link 7: 25 GB/s
               Link 8: 25 GB/s
               Link 9: 25 GB/s
               Link 10: 25 GB/s
               Link 11: 25 GB/s
      GPU 1: NVIDIA A100-SXM4-80GB (UUID: GPU-64a2f685-bb12-c4af-105c-0726ece9c8d7)
               Link 0: 25 GB/s
               Link 1: 25 GB/s
               Link 2: 25 GB/s
               Link 3: 25 GB/s
               Link 4: 25 GB/s
               Link 5: 25 GB/s
               Link 6: 25 GB/s
               Link 7: 25 GB/s
               Link 8: 25 GB/s
               Link 9: 25 GB/s
               Link 10: 25 GB/s
               Link 11: 25 GB/s
      GPU 2: NVIDIA A100-SXM4-80GB (UUID: GPU-2269851b-71cd-f6c7-50c5-ba1525cf3ce8)
               Link 0: 25 GB/s
               Link 1: 25 GB/s
               Link 2: 25 GB/s
               Link 3: 25 GB/s
               Link 4: 25 GB/s
               Link 5: 25 GB/s
               Link 6: 25 GB/s
               Link 7: 25 GB/s
               Link 8: 25 GB/s
               Link 9: 25 GB/s
               Link 10: 25 GB/s
               Link 11: 25 GB/s
      GPU 3: NVIDIA A100-SXM4-80GB (UUID: GPU-4c397bbf-95fc-5c29-918a-a429cbe45a7a)
               Link 0: 25 GB/s
               Link 1: 25 GB/s
               Link 2: 25 GB/s
               Link 3: 25 GB/s
               Link 4: 25 GB/s
               Link 5: 25 GB/s
               Link 6: 25 GB/s
               Link 7: 25 GB/s
               Link 8: 25 GB/s
               Link 9: 25 GB/s
               Link 10: 25 GB/s
               Link 11: 25 GB/s
      GPU 4: NVIDIA A100-SXM4-80GB (UUID: GPU-0e350204-9fb6-2cbe-538e-8f7849658eb8)
               Link 0: 25 GB/s
               Link 1: 25 GB/s
               Link 2: 25 GB/s
               Link 3: 25 GB/s
               Link 4: 25 GB/s
               Link 5: 25 GB/s
               Link 6: 25 GB/s
               Link 7: 25 GB/s
               Link 8: 25 GB/s
               Link 9: 25 GB/s
               Link 10: 25 GB/s
               Link 11: 25 GB/s
      GPU 5: NVIDIA A100-SXM4-80GB (UUID: GPU-45f0c453-4760-edd4-3af9-25c5ea7473a5)
               Link 0: 25 GB/s
               Link 1: 25 GB/s
               Link 2: 25 GB/s
               Link 3: 25 GB/s
               Link 4: 25 GB/s
               Link 5: 25 GB/s
               Link 6: 25 GB/s
               Link 7: 25 GB/s
               Link 8: 25 GB/s
               Link 9: 25 GB/s
               Link 10: 25 GB/s
               Link 11: 25 GB/s
      GPU 6: NVIDIA A100-SXM4-80GB (UUID: GPU-38409794-bb34-430e-3c50-90b42cb2bb72)
               Link 0: 25 GB/s
               Link 1: 25 GB/s
               Link 2: 25 GB/s
               Link 3: 25 GB/s
               Link 4: 25 GB/s
               Link 5: 25 GB/s
               Link 6: 25 GB/s
               Link 7: 25 GB/s
               Link 8: 25 GB/s
               Link 9: 25 GB/s
               Link 10: 25 GB/s
               Link 11: 25 GB/s
      GPU 7: NVIDIA A100-SXM4-80GB (UUID: GPU-3fb478aa-801b-eb64-55c2-0ffc3f2ce404)
               Link 0: 25 GB/s
               Link 1: 25 GB/s
               Link 2: 25 GB/s
               Link 3: 25 GB/s
               Link 4: 25 GB/s
               Link 5: 25 GB/s
               Link 6: 25 GB/s
               Link 7: 25 GB/s
               Link 8: 25 GB/s
               Link 9: 25 GB/s
               Link 10: 25 GB/s
               Link 11: 25 GB/s
      NVSwitch 상태 예시

    • NVLinks HW 상태 확인

      배경색 변경
      ~$ nvidia-smi topo -m
      ~$ nvidia-smi topo -m
      NVLink HW 상태 확인 코드 예시
      배경색 변경
              GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity   GPU NUMA ID
      GPU0     X      NV12    NV12    NV12    NV12    NV12    NV12    NV12    0-127   0-7             N/A
      GPU1    NV12	 X      NV12    NV12    NV12    NV12    NV12    NV12    0-127   0-7             N/A
      GPU2    NV12    NV12     X      NV12    NV12    NV12    NV12    NV12    0-127   0-7             N/A
      GPU3    NV12    NV12    NV12     X      NV12    NV12    NV12    NV12    0-127   0-7             N/A
      GPU4    NV12    NV12    NV12    NV12     X      NV12    NV12    NV12    0-127   0-7             N/A
      GPU5    NV12    NV12    NV12    NV12    NV12     X      NV12    NV12    0-127   0-7             N/A
      GPU6    NV12    NV12    NV12    NV12    NV12    NV12     X      NV12    0-127   0-7             N/A
      GPU7    NV12    NV12    NV12    NV12    NV12    NV12    NV12     X      0-127   0-7             N/A
      
      Legend:
      
        X    = Self
        SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
        NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
        PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
        PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
        PIX  = Connection traversing at most a single PCIe bridge
        NV#  = Connection traversing a bonded set of # NVLinks
              GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity   GPU NUMA ID
      GPU0     X      NV12    NV12    NV12    NV12    NV12    NV12    NV12    0-127   0-7             N/A
      GPU1    NV12	 X      NV12    NV12    NV12    NV12    NV12    NV12    0-127   0-7             N/A
      GPU2    NV12    NV12     X      NV12    NV12    NV12    NV12    NV12    0-127   0-7             N/A
      GPU3    NV12    NV12    NV12     X      NV12    NV12    NV12    NV12    0-127   0-7             N/A
      GPU4    NV12    NV12    NV12    NV12     X      NV12    NV12    NV12    0-127   0-7             N/A
      GPU5    NV12    NV12    NV12    NV12    NV12     X      NV12    NV12    0-127   0-7             N/A
      GPU6    NV12    NV12    NV12    NV12    NV12    NV12     X      NV12    0-127   0-7             N/A
      GPU7    NV12    NV12    NV12    NV12    NV12    NV12    NV12     X      0-127   0-7             N/A
      
      Legend:
      
        X    = Self
        SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
        NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
        PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
        PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
        PIX  = Connection traversing at most a single PCIe bridge
        NV#  = Connection traversing a bonded set of # NVLinks
      NVLink HW 상태 확인 코드 예시

  3. InfiniBand(IB) HCA 카드 HW 상태와 Link를 확인하세요.

    배경색 변경
    user@bm-dev-001:~$ ibdev2netdev -v
    user@bm-dev-001:~$ ibdev2netdev -v
    HW 상태 확인 명령어 예시
    배경색 변경
    cat: /sys/class/infiniband/mlx5_0/device/vpd: Permission denied
    0000:45:00.0 mlx5_0 (MT4123 -            )                 fw 20.29.1016 port 1 (ACTIVE) ==> ibs18 (Down)
    cat: /sys/class/infiniband/mlx5_1/device/vpd: Permission denied
    0000:0e:00.0 mlx5_1 (MT4123 -            )                 fw 20.29.1016 port 1 (ACTIVE) ==> ibs17 (Down)
    cat: /sys/class/infiniband/mlx5_2/device/vpd: Permission denied
    0000:c5:00.0 mlx5_2 (MT4123 -            )                 fw 20.29.1016 port 1 (ACTIVE) ==> ibs20 (Down)
    cat: /sys/class/infiniband/mlx5_3/device/vpd: Permission denied
    0000:85:00.0 mlx5_3 (MT4123 -            )                 fw 20.29.1016 port 1 (ACTIVE) ==> ibs19 (Down)
    user@bm-dev-001:~$
    cat: /sys/class/infiniband/mlx5_0/device/vpd: Permission denied
    0000:45:00.0 mlx5_0 (MT4123 -            )                 fw 20.29.1016 port 1 (ACTIVE) ==> ibs18 (Down)
    cat: /sys/class/infiniband/mlx5_1/device/vpd: Permission denied
    0000:0e:00.0 mlx5_1 (MT4123 -            )                 fw 20.29.1016 port 1 (ACTIVE) ==> ibs17 (Down)
    cat: /sys/class/infiniband/mlx5_2/device/vpd: Permission denied
    0000:c5:00.0 mlx5_2 (MT4123 -            )                 fw 20.29.1016 port 1 (ACTIVE) ==> ibs20 (Down)
    cat: /sys/class/infiniband/mlx5_3/device/vpd: Permission denied
    0000:85:00.0 mlx5_3 (MT4123 -            )                 fw 20.29.1016 port 1 (ACTIVE) ==> ibs19 (Down)
    user@bm-dev-001:~$
    HW 상태 확인 결과 예시
    배경색 변경
    root@bm-dev-001:~# ibstat
    root@bm-dev-001:~# ibstat
    Link 확인 명령어 예시
    배경색 변경
    CA 'mlx5_0'
            CA type: MT4123
            Number of ports: 1
            Firmware version: 20.29.1016
            Hardware version: 0
            Node GUID: 0x88e9a4ffff5060ac
            System image GUID: 0x88e9a4ffff5060ac
            Port 1:
                    State: Active
                    Physical state: LinkUp
                    Rate: 200
                    Base lid: 8
                    LMC: 0
                    SM lid: 1
                    Capability mask: 0x2651e848
                    Port GUID: 0x88e9a4ffff5060ac
                    Link layer: InfiniBand
    CA 'mlx5_1'
            CA type: MT4123
            Number of ports: 1
            Firmware version: 20.29.1016
            Hardware version: 0
            Node GUID: 0x88e9a4ffff504080
            System image GUID: 0x88e9a4ffff504080
            Port 1:
                    State: Active
                    Physical state: LinkUp
                    Rate: 200
                    Base lid: 5
                    LMC: 0
                    SM lid: 1
                    Capability mask: 0x2651e848
                    Port GUID: 0x88e9a4ffff504080
                    Link layer: InfiniBand
    CA 'mlx5_2'
            CA type: MT4123
            Number of ports: 1
            Firmware version: 20.29.1016
            Hardware version: 0
            Node GUID: 0x88e9a4ffff505038
            System image GUID: 0x88e9a4ffff505038
            Port 1:
                    State: Active
                    Physical state: LinkUp
                    Rate: 200
                    Base lid: 2
                    LMC: 0
                    SM lid: 1
                    Capability mask: 0x2651e848
                    Port GUID: 0x88e9a4ffff505038
                    Link layer: InfiniBand
    CA 'mlx5_3'
            CA type: MT4123
            Number of ports: 1
            Firmware version: 20.29.1016
            Hardware version: 0
            Node GUID: 0x88e9a4ffff504094
            System image GUID: 0x88e9a4ffff504094
            Port 1:
                    State: Active
                    Physical state: LinkUp
                    Rate: 200
                    Base lid: 7
                    LMC: 0
                    SM lid: 1
                    Capability mask: 0x2651e848
                    Port GUID: 0x88e9a4ffff504094
                    Link layer: InfiniBand
    CA 'mlx5_0'
            CA type: MT4123
            Number of ports: 1
            Firmware version: 20.29.1016
            Hardware version: 0
            Node GUID: 0x88e9a4ffff5060ac
            System image GUID: 0x88e9a4ffff5060ac
            Port 1:
                    State: Active
                    Physical state: LinkUp
                    Rate: 200
                    Base lid: 8
                    LMC: 0
                    SM lid: 1
                    Capability mask: 0x2651e848
                    Port GUID: 0x88e9a4ffff5060ac
                    Link layer: InfiniBand
    CA 'mlx5_1'
            CA type: MT4123
            Number of ports: 1
            Firmware version: 20.29.1016
            Hardware version: 0
            Node GUID: 0x88e9a4ffff504080
            System image GUID: 0x88e9a4ffff504080
            Port 1:
                    State: Active
                    Physical state: LinkUp
                    Rate: 200
                    Base lid: 5
                    LMC: 0
                    SM lid: 1
                    Capability mask: 0x2651e848
                    Port GUID: 0x88e9a4ffff504080
                    Link layer: InfiniBand
    CA 'mlx5_2'
            CA type: MT4123
            Number of ports: 1
            Firmware version: 20.29.1016
            Hardware version: 0
            Node GUID: 0x88e9a4ffff505038
            System image GUID: 0x88e9a4ffff505038
            Port 1:
                    State: Active
                    Physical state: LinkUp
                    Rate: 200
                    Base lid: 2
                    LMC: 0
                    SM lid: 1
                    Capability mask: 0x2651e848
                    Port GUID: 0x88e9a4ffff505038
                    Link layer: InfiniBand
    CA 'mlx5_3'
            CA type: MT4123
            Number of ports: 1
            Firmware version: 20.29.1016
            Hardware version: 0
            Node GUID: 0x88e9a4ffff504094
            System image GUID: 0x88e9a4ffff504094
            Port 1:
                    State: Active
                    Physical state: LinkUp
                    Rate: 200
                    Base lid: 7
                    LMC: 0
                    SM lid: 1
                    Capability mask: 0x2651e848
                    Port GUID: 0x88e9a4ffff504094
                    Link layer: InfiniBand
    Link 확인 결과 예시

IB bandwidth 통신 확인

IB bandwidth 통신 상태를 확인(ib_send_bw)하여 IaaS HW 레벨을 점검하려면 다음 절차를 따르세요.

  1. IB HCA 인터페이스의 이름을 확인하세요.

    • 다음 예시에서 IB용 Port: mlx5_0, mlx5_4, mlx_5_5, mlx5_8
      배경색 변경
      ~$ ibdev2netdev -v
      ~$ ibdev2netdev -v
      IB HCA 인터페이스 이름 확인 예시
      배경색 변경
      0000:1a:00.0 mlx5_0 (MT4129 - 1028SN     ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (ACTIVE) ==> ibp26s0 (Down)
      0000:1b:00.0 mlx5_1 (MT4123 - 1028SN     ) Mellanox ConnectX-6 Single Port VPI HDR QSFP Adapter fw 20.38.1002 port 1 (ACTIVE) ==> bond-nas (Up)
      0000:3c:00.0 mlx5_2 (MT4129 - 1028SN     ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (DOWN  ) ==> ibp60s0 (Down)
      0000:4d:00.0 mlx5_3 (MT4129 - 1028SN     ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (DOWN  ) ==> ibp77s0 (Down)
      0000:5e:00.0 mlx5_4 (MT4129 - 1028SN     ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (ACTIVE) ==> ibp94s0 (Down)
      0000:9c:00.0 mlx5_5 (MT4129 - 1028SN     ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (ACTIVE) ==> ibp156s0 (Down)
      0000:9d:00.0 mlx5_6 (MT4123 - 1028SN     ) Mellanox ConnectX-6 Single Port VPI HDR QSFP Adapter fw 20.38.1002 port 1 (ACTIVE) ==> bond-nas (Up)
      0000:bc:00.0 mlx5_7 (MT4129 - 1028SN     ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (DOWN  ) ==> ibp188s0 (Down)
      0000:cc:00.0 mlx5_8 (MT4129 - 1028SN     ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (ACTIVE) ==> ibp204s0 (Down)
      0000:dc:00.0 mlx5_9 (MT4129 - 1028SN     ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (DOWN  ) ==> ibp220s0 (Down)
      0000:1a:00.0 mlx5_0 (MT4129 - 1028SN     ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (ACTIVE) ==> ibp26s0 (Down)
      0000:1b:00.0 mlx5_1 (MT4123 - 1028SN     ) Mellanox ConnectX-6 Single Port VPI HDR QSFP Adapter fw 20.38.1002 port 1 (ACTIVE) ==> bond-nas (Up)
      0000:3c:00.0 mlx5_2 (MT4129 - 1028SN     ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (DOWN  ) ==> ibp60s0 (Down)
      0000:4d:00.0 mlx5_3 (MT4129 - 1028SN     ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (DOWN  ) ==> ibp77s0 (Down)
      0000:5e:00.0 mlx5_4 (MT4129 - 1028SN     ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (ACTIVE) ==> ibp94s0 (Down)
      0000:9c:00.0 mlx5_5 (MT4129 - 1028SN     ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (ACTIVE) ==> ibp156s0 (Down)
      0000:9d:00.0 mlx5_6 (MT4123 - 1028SN     ) Mellanox ConnectX-6 Single Port VPI HDR QSFP Adapter fw 20.38.1002 port 1 (ACTIVE) ==> bond-nas (Up)
      0000:bc:00.0 mlx5_7 (MT4129 - 1028SN     ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (DOWN  ) ==> ibp188s0 (Down)
      0000:cc:00.0 mlx5_8 (MT4129 - 1028SN     ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (ACTIVE) ==> ibp204s0 (Down)
      0000:dc:00.0 mlx5_9 (MT4129 - 1028SN     ) Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter fw 28.38.1002 port 1 (DOWN  ) ==> ibp220s0 (Down)
      IB HCA 인터페이스 이름 확인 결과 예시
  2. SERVER Side 명령어를 사용하여 상호 통신 상태를 확인하세요.

    배경색 변경
    ~$ ib_send_bw -d mlx5_0 -i 1 –F
    ~$ ib_send_bw -d mlx5_0 -i 1 –F
    SERVER Side 명령어 예시
    배경색 변경
    ************************************
    * Waiting for client to connect... *
    ************************************
    ---------------------------------------------------------------------------------------
                        Send BW Test
     Dual-port       : OFF		Device         : mlx5_0
     Number of qps   : 1		Transport type : IB
     Connection type : RC		Using SRQ      : OFF
     PCIe relax order: ON
     ibv_wr* API     : ON
     TX depth        : 128
     CQ Moderation   : 100
     Mtu             : 4096[B]
     Link type       : IB
     Max inline data : 0[B]
     rdma_cm QPs	 : OFF
     Data ex. method : Ethernet
    ---------------------------------------------------------------------------------------
     local address: LID 0x0f QPN 0x6d95 PSN 0xb974a
     remote address: LID 0x01 QPN 0x6dd2 PSN 0xc8a18c
    ---------------------------------------------------------------------------------------
     #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
     65536      1000             0.00               19827.40                   0.317238
    ---------------------------------------------------------------------------------------
    ************************************
    * Waiting for client to connect... *
    ************************************
    ---------------------------------------------------------------------------------------
                        Send BW Test
     Dual-port       : OFF		Device         : mlx5_0
     Number of qps   : 1		Transport type : IB
     Connection type : RC		Using SRQ      : OFF
     PCIe relax order: ON
     ibv_wr* API     : ON
     TX depth        : 128
     CQ Moderation   : 100
     Mtu             : 4096[B]
     Link type       : IB
     Max inline data : 0[B]
     rdma_cm QPs	 : OFF
     Data ex. method : Ethernet
    ---------------------------------------------------------------------------------------
     local address: LID 0x0f QPN 0x6d95 PSN 0xb974a
     remote address: LID 0x01 QPN 0x6dd2 PSN 0xc8a18c
    ---------------------------------------------------------------------------------------
     #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
     65536      1000             0.00               19827.40                   0.317238
    ---------------------------------------------------------------------------------------
    상호 통신 상태 확인 결과 예시

  3. CLIENT Side 명령어를 사용하여 상호 통신 상태를 확인하세요.

    배경색 변경
    ~$ ib_send_bw -d mlx5_0 -i 1 -F <SERVER Side IP>
    ~$ ib_send_bw -d mlx5_0 -i 1 -F <SERVER Side IP>
    CLIENT Side 명령어 예시
    배경색 변경
    ---------------------------------------------------------------------------------------
                        Send BW Test
     Dual-port       : OFF		Device         : mlx5_0
     Number of qps   : 1		Transport type : IB
     Connection type : RC		Using SRQ      : OFF
     PCIe relax order: ON
     ibv_wr* API     : ON
     RX depth        : 512
     CQ Moderation   : 100
     Mtu             : 4096[B]
     Link type       : IB
     Max inline data : 0[B]
     rdma_cm QPs	 : OFF
     Data ex. method : Ethernet
    ---------------------------------------------------------------------------------------
     local address: LID 0x01 QPN 0x6dd2 PSN 0xc8a18c
     remote address: LID 0x0f QPN 0x6d95 PSN 0xb974a
    ---------------------------------------------------------------------------------------
     #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
     65536      1000             19008.49            19006.37                  0.304102
    ---------------------------------------------------------------------------------------
    ---------------------------------------------------------------------------------------
                        Send BW Test
     Dual-port       : OFF		Device         : mlx5_0
     Number of qps   : 1		Transport type : IB
     Connection type : RC		Using SRQ      : OFF
     PCIe relax order: ON
     ibv_wr* API     : ON
     RX depth        : 512
     CQ Moderation   : 100
     Mtu             : 4096[B]
     Link type       : IB
     Max inline data : 0[B]
     rdma_cm QPs	 : OFF
     Data ex. method : Ethernet
    ---------------------------------------------------------------------------------------
     local address: LID 0x01 QPN 0x6dd2 PSN 0xc8a18c
     remote address: LID 0x0f QPN 0x6d95 PSN 0xb974a
    ---------------------------------------------------------------------------------------
     #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
     65536      1000             19008.49            19006.37                  0.304102
    ---------------------------------------------------------------------------------------
    상호 통신 상태 확인 결과 예시

IB 서비스 유관 커널 모듈 확인

IB 서비스 유관 커널 모듈을 확인(lsmod)하여 IaaS HW 레벨을 점검하세요.

  • IB 서비스 유관 커널 모듈 확인 - nvidia_peermem

    배경색 변경
    ~$ lsmod | grep nvidia_peermem
    ~$ lsmod | grep nvidia_peermem
    nvidia_peermem 확인 명령어 예시
    배경색 변경
    nvidia_peermem         16384  0
    ib_core               425984  9 rdma_cm,ib_ipoib,nvidia_peermem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
    nvidia              56524800  451 nvidia_uvm,nvidia_peermem,nvidia_modeset
    nvidia_peermem         16384  0
    ib_core               425984  9 rdma_cm,ib_ipoib,nvidia_peermem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
    nvidia              56524800  451 nvidia_uvm,nvidia_peermem,nvidia_modeset
    nvidia_peermem 확인 결과 예시

  • IB 서비스 유관 커널 모듈 확인 - IB Card (HCA) driver

    배경색 변경
    ~$ lsmod|egrep 'ib_|_ib|mlx' 
    ~$ lsmod|egrep 'ib_|_ib|mlx' 
    IB Card (HCA) driver 확인 명령어 예시
    배경색 변경
    mlx5_ib               393216  0
    ib_uverbs             163840  2 irdma,mlx5_ib
    ib_core               393216  3 irdma,ib_uverbs,mlx5_ib
    mlx5_core            1593344  1 mlx5_ib
    mlxfw                  32768  1 mlx5_core
    psample                20480  1 mlx5_core
    tls                   114688  1 mlx5_core
    pci_hyperv_intf        16384  1 mlx5_core
    ib_ipoib              139264  0
    ib_cm                 131072  2 rdma_cm,ib_ipoib
    ib_umad                40960  0
    mlx5_ib               454656  0
    ib_uverbs             135168  2 rdma_ucm,mlx5_ib
    ib_core               434176  9 rdma_cm,ib_ipoib,nvidia_peermem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
    libcrc32c              16384  5 nf_conntrack,nf_nat,btrfs,nf_tables,raid456
    mlx5_core            2064384  1 mlx5_ib
    mlx_compat             69632  11 rdma_cm,ib_ipoib,mlxdevm,iw_cm,ib_umad,ib_core,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm,mlx5_core
    mlx5_ib               393216  0
    ib_uverbs             163840  2 irdma,mlx5_ib
    ib_core               393216  3 irdma,ib_uverbs,mlx5_ib
    mlx5_core            1593344  1 mlx5_ib
    mlxfw                  32768  1 mlx5_core
    psample                20480  1 mlx5_core
    tls                   114688  1 mlx5_core
    pci_hyperv_intf        16384  1 mlx5_core
    ib_ipoib              139264  0
    ib_cm                 131072  2 rdma_cm,ib_ipoib
    ib_umad                40960  0
    mlx5_ib               454656  0
    ib_uverbs             135168  2 rdma_ucm,mlx5_ib
    ib_core               434176  9 rdma_cm,ib_ipoib,nvidia_peermem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
    libcrc32c              16384  5 nf_conntrack,nf_nat,btrfs,nf_tables,raid456
    mlx5_core            2064384  1 mlx5_ib
    mlx_compat             69632  11 rdma_cm,ib_ipoib,mlxdevm,iw_cm,ib_umad,ib_core,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm,mlx5_core
    IB Card (HCA) driver 확인 결과 예시

스토리지 물리 디스크 자원 및 Multi-Path 확인

스토리지 물리 디스크 자원 및 Multi-Path를 확인하여 IaaS HW 레벨을 점검하세요.

  • 스토리지 물리 디스크 자원 확인 결과

    배경색 변경
    root@bm-dev-002:/tmp# fdisk –l
    root@bm-dev-002:/tmp# fdisk –l
    스토리지 물리 디스크 자원 확인 결과 예시

  • Multi-Path 확인 결과

    배경색 변경
    root@bm-dev-002:/tmp# multipath –ll
    root@bm-dev-002:/tmp# multipath –ll
    Multi-Path 확인 결과 예시

Multi-node GPU Cluster 신규 배포 후 Service Network 확인

다음 명령어를 이용하여 BondingSlave InterfaceMII Status가 up인지 확인하세요.

  • Service Network 확인 명령어

    배경색 변경
    ~$ cat /proc/net/bonding/bond-srv
    ~$ cat /proc/net/bonding/bond-srv
    Service Network 확인 명령어 예시

  • Service Network 확인 결과

    배경색 변경
    Ethernet Channel Bonding Driver: v5.15.0-25-generic
    Bonding Mode: fault-tolerance (active-backup)
    Primary Slave: None
    Currently Active Slave: ens9f0
    MII Status: up
    MII Polling Interval (ms): 100
    Up Delay (ms): 0
    Down Delay (ms): 0
    Peer Notification Delay (ms): 0
    
    Slave Interface: ens9f0
    MII Status: up
    Speed: 100000 Mbps
    Duplex: full
    Link Failure Count: 0
    Permanent HW addr: 30:3e:a7:02:35:70
    Slave queue ID: 0
    
    Slave Interface: ens11f0
    MII Status: up
    Speed: 100000 Mbps
    Duplex: full
    Link Failure Count: 0
    Permanent HW addr: 30:3e:a7:02:2f:e8
    Slave queue ID: 0
    Ethernet Channel Bonding Driver: v5.15.0-25-generic
    Bonding Mode: fault-tolerance (active-backup)
    Primary Slave: None
    Currently Active Slave: ens9f0
    MII Status: up
    MII Polling Interval (ms): 100
    Up Delay (ms): 0
    Down Delay (ms): 0
    Peer Notification Delay (ms): 0
    
    Slave Interface: ens9f0
    MII Status: up
    Speed: 100000 Mbps
    Duplex: full
    Link Failure Count: 0
    Permanent HW addr: 30:3e:a7:02:35:70
    Slave queue ID: 0
    
    Slave Interface: ens11f0
    MII Status: up
    Speed: 100000 Mbps
    Duplex: full
    Link Failure Count: 0
    Permanent HW addr: 30:3e:a7:02:2f:e8
    Slave queue ID: 0
    Service Network 확인 결과 예시

참고
일부 Slave Interfacedown 상태일 경우, Support Center의 문의하기를 이용하여 이상 상황을 전달하고 조치받으세요.

Multi-node GPU Cluster 신규 배포 후 Time Server와 시간 동기화 확인

OS 이미지에는 chrony 데몬 설치 및 SCP NTP 서버 동기화가 설정되어 있습니다. 다음 명령어를 이용하여 MS Name 열에 ^*로 표기된 라인이 있는지 확인하세요.

  • chrony 데몬 source 및 동기화 상태 확인 명령어

    배경색 변경
    ~$ chronyc sources -V
    ~$ chronyc sources -V
    상태 확인 명령어 예시

  • chrony 데몬 상태 확인 결과

    배경색 변경
    MS Name/IP address         Stratum Poll Reach LastRx Last sample
    ===============================================================================
    ^* 198.19.0.54                   4  10   377  1040    -16us[  -37us] +/- 9982us
    MS Name/IP address         Stratum Poll Reach LastRx Last sample
    ===============================================================================
    ^* 198.19.0.54                   4  10   377  1040    -16us[  -37us] +/- 9982us
    chrony 데몬 상태 확인 결과 예시

ServiceWatch Agent 설치하기
Release Note