GPU Server에서 Multi-instance GPU 사용하기

GPU Server를 생성한 후 GPU Server의 VM(Guest OS)에서 MIG (Multi-instance GPU) 기능을 활성화하고 Instance를 생성해 사용할 수 있습니다.

Multi-instance GPU (NVIDIA A100) 살펴보기

NVIDIA A100은 NVIDIA 암페어(Ampere) 아키텍처를 기반으로 하는 Multi-instance GPU(MIG)로, 최대 7개의 독립된 GPU Instance로 안전하게 분할되어 CUDA (Compute Unified Device Architecture, 연산통합 장치설계) Application을 운용할 수 있습니다. NVIDIA A100은 고대역폭 메모리(HBM: high bandwidth memory)와 캐시를 활용하는 동시에 GPU 사용에 최적화된 방식으로 컴퓨팅 자원을 할당함으로써 다수의 사용자들에게 독립적인 GPU 자원을 제공할 수 있습니다. 사용자는 각 워크로드의 병렬 실행을 통해 GPU 최대 연산 용량에 도달하지 않은 워크로드를 활용할 수 있으므로, GPU 사용율을 극대화할 수 있습니다.

Multi-instance GPU 구성도
그림. Multi-instance GPU 구성도

Multi-instance GPU 기능 사용하기

Multi-instance GPU 기능을 사용하려면 Samsung Cloud Platform에서 GPU Server 서비스를 생성한 후 A100 GPU가 할당된 VM Instance(GuestOS)를 생성해야 합니다. GPU Server 생성 완료 후, 아래의 MIG 적용 순서와 MIG 해제 순서를 따라 적용해볼 수 있습니다.

Multi-instance GPU 생성
그림. Multi-instance GPU 생성

MIG 적용 순서
MIG 활성화 → GPU Instance 생성 → Compute Instance 생성 → MIG 사용
MIG 해제 순서
Compute Instance 삭제 → GPU Instance 삭제 → MIG 기능 해제(비활성화)

참고
  • MIG 기능을 사용하기 위한 시스템 요구사항은 다음과 같습니다(NVIDIA - Supported GPUs 참고).
    • CUDA toolkit 11, NVIDIA driver 450.80.02 또는 이후 버전
    • CUDA toolkit 11을 지원하는 리눅스 배포 운영체제
  • 컨테이너 또는 쿠버네티스 서비스 운용 시 MIG 기능을 사용하기 위한 요구사항은 다음과 같습니다.
    • NVIDIA Container Toolkit(nvidia-docker2) v 2.5.0 또는 이후 버전
    • NVIDIA K8s Device Plugin v 0.7.0 또는 이후 버전
    • NVIDIA gpu-feature-discovery v 0.2.0 또는 이후 버전

MIG 적용 및 사용하기

MIG를 활성화하고 Instance를 생성해 작업을 할당하려면 다음 절차를 따르세요.

MIG 적용 순서
MIG 활성화 → GPU Instance 생성 → Compute Instance 생성 → MIG 사용

MIG 활성화

  1. MIG를 적용하기 전 VM Instance(GuestOS)에서 GPU 상태를 확인하세요.

    • MIG modeDisabled 상태인지 확인하세요.
      배경색 변경
      $ nvidia-smi
      Mon Sep 27 08:37:08 2021
      +-----------------------------------------------------------------------------+
      | NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
      |-------------------------------+----------------------+----------------------|
      | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
      | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
      |                               |                      |               MIG M. |
      |===============================+======================+======================|
      |   0  NVDIA A100-SXM...  Off   | 00000000:05:00.0 Off |                    0 |
      | N/A   32C   P0    59W / 400W  |      0MiB / 81251MiB |      0%      Default |
      |                               |                      |             Disabled |
      +-------------------------------+----------------------+----------------------+
      
      +-----------------------------------------------------------------------------+
      | Processes:                                                                  |
      |  GPU   GI   CI       PID   Type   Process name                   GPU Memory |
      |        ID   ID                                                   Usage      |
      |=============================================================================|
      | No running processes found                                                  |
      +-----------------------------------------------------------------------------+
      $ nvidia-smi
      Mon Sep 27 08:37:08 2021
      +-----------------------------------------------------------------------------+
      | NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
      |-------------------------------+----------------------+----------------------|
      | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
      | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
      |                               |                      |               MIG M. |
      |===============================+======================+======================|
      |   0  NVDIA A100-SXM...  Off   | 00000000:05:00.0 Off |                    0 |
      | N/A   32C   P0    59W / 400W  |      0MiB / 81251MiB |      0%      Default |
      |                               |                      |             Disabled |
      +-------------------------------+----------------------+----------------------+
      
      +-----------------------------------------------------------------------------+
      | Processes:                                                                  |
      |  GPU   GI   CI       PID   Type   Process name                   GPU Memory |
      |        ID   ID                                                   Usage      |
      |=============================================================================|
      | No running processes found                                                  |
      +-----------------------------------------------------------------------------+
      코드블록. nvidia-smi 명령어 - GPU 비활성화 상태 확인 (1)
      배경색 변경
      $ nvidia-smi –L
      GPU 0: NVIDIA A100-SXM-80GB (UUID: GPU-c956838f-494a-92b2-6818-56eb28fe25e0)
      $ nvidia-smi –L
      GPU 0: NVIDIA A100-SXM-80GB (UUID: GPU-c956838f-494a-92b2-6818-56eb28fe25e0)
      코드블록. nvidia-smi 명령어 - GPU 비활성화 상태 확인 (2)
  2. VM Instance(GuestOS)에서 GPU별로 MIG를 활성화(Enable)하고 VM Instance를 재부팅하세요.

    배경색 변경
    $ nvidia-smi –I 0 –mig 1
    Enabled MIG mode for GPU 00000000:05:00.0
    All done.
    
    # reboot
    $ nvidia-smi –I 0 –mig 1
    Enabled MIG mode for GPU 00000000:05:00.0
    All done.
    
    # reboot
    코드블록. nvidia-smi 명령어 - MIG 활성화

참고

GPU 모니터링 에이전트가 다음과 같은 경고 메시지를 표시하는 경우, MIG를 활성화하기 전에 nvsm 및 dcgm 서비스를 중단하세요.

Warning: MIG mode is in pending enable state for GPU 00000000:05:00.0: In use by another client. 00000000:05:00.0 is currently being used by one or more other processes (e.g. CUDA application or a monitoring application such as another instance of nvidia-smi).

# systemctl stop nvsm
# systemctl stop dcgm
  • MIG 작업을 마친 후 nvsm 및 dcgm 서비스를 다시 시작하세요.
  1. VM Instance(GuestOS)에서 MIG를 적용한 후 GPU 상태를 확인하세요.
    • MIG modeEnabled 상태인지 확인하세요.
      배경색 변경
      $ nvidia-smi
      Mon Sep 27 09:44:33 2021
      +-----------------------------------------------------------------------------+
      | NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
      |-------------------------------+----------------------+----------------------|
      | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
      | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
      |                               |                      |               MIG M. |
      |===============================+======================+======================|
      |   0  NVDIA A100-SXM...  Off   | 00000000:05:00.0 Off |                   On |
      | N/A   32C   P0    59W / 400W  |      0MiB / 81251MiB |      0%      Default |
      |                               |                      |              Enabled |
      +-------------------------------+----------------------+----------------------+
      +-----------------------------------------------------------------------------+
      | MIG devices:                                                                |
      +-----------------------------------------------------------------------------+
      |  GPU  GI  CI  MIG |        Memory-Usage |        Vol|        Shared         |
      |       ID  ID  Dev |          BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
      |                   |                     |        ECC|                       |
      |=============================================================================|
      | No MIG devices found                                                        |
      +-----------------------------------------------------------------------------+
      +-----------------------------------------------------------------------------+
      | Processes:                                                                  |
      |  GPU   GI   CI       PID   Type   Process name                   GPU Memory |
      |        ID   ID                                                   Usage      |
      |=============================================================================|
      | No running processes found                                                  |
      +-----------------------------------------------------------------------------+
      $ nvidia-smi
      Mon Sep 27 09:44:33 2021
      +-----------------------------------------------------------------------------+
      | NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
      |-------------------------------+----------------------+----------------------|
      | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
      | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
      |                               |                      |               MIG M. |
      |===============================+======================+======================|
      |   0  NVDIA A100-SXM...  Off   | 00000000:05:00.0 Off |                   On |
      | N/A   32C   P0    59W / 400W  |      0MiB / 81251MiB |      0%      Default |
      |                               |                      |              Enabled |
      +-------------------------------+----------------------+----------------------+
      +-----------------------------------------------------------------------------+
      | MIG devices:                                                                |
      +-----------------------------------------------------------------------------+
      |  GPU  GI  CI  MIG |        Memory-Usage |        Vol|        Shared         |
      |       ID  ID  Dev |          BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
      |                   |                     |        ECC|                       |
      |=============================================================================|
      | No MIG devices found                                                        |
      +-----------------------------------------------------------------------------+
      +-----------------------------------------------------------------------------+
      | Processes:                                                                  |
      |  GPU   GI   CI       PID   Type   Process name                   GPU Memory |
      |        ID   ID                                                   Usage      |
      |=============================================================================|
      | No running processes found                                                  |
      +-----------------------------------------------------------------------------+
      코드블록. nvidia-smi 명령어 - GPU 활성화 상태 확인 (1)
      배경색 변경
      $ nvidia-smi –L
      GPU 0: NVIDIA A100-SXM-80GB (UUID: GPU-c956838f-494a-92b2-6818-56eb28fe25e0)
      $ nvidia-smi –L
      GPU 0: NVIDIA A100-SXM-80GB (UUID: GPU-c956838f-494a-92b2-6818-56eb28fe25e0)
      코드블록. nvidia-smi 명령어 - GPU 활성화 상태 확인 (2)

GPU Instance 생성

MIG 활성화하고 상태를 확인하였다면, GPU Instance를 생성할 수 있습니다.

  1. 생성할 수 있는 MIG GPU Instance 프로파일 목록을 확인하세요.

    배경색 변경
    $ nvidia-smi mig -i [GPU ID] -lgip
    $ nvidia-smi mig -i [GPU ID] -lgip
    코드블록. nvidia-smi 명령어 - MIG GPU Instance 프로파일 목록 확인

    배경색 변경
    $ nvidia-smi mig -i 0 -lgip
    +-----------------------------------------------------------------------------+
    | GPU instance profiles:                                                      |
    | GPU   Name             ID    Instances   Memory     P2P    SM    DEC   ENC  |
    |                              Free/Total   GiB              CE    JPEG  OFA  |
    |=============================================================================|
    |   0 MIG 1g.10gb        19    7/7         9.50       No     14     0     0   |
    |                                                             1     0     0   |
    +-----------------------------------------------------------------------------+
    |   0 MIG 1g.10gb+me     20    1/1         9.50       No     14     0     0   |
    |                                                             1     1     1   |
    +-----------------------------------------------------------------------------+
    |   0 MIG 2g.20gb        14    3/3         19.50      No     28     1     0   |
    |                                                             2     0     0   |
    +-----------------------------------------------------------------------------+
    |   0 MIG 3g.40gb         9    2/2         39.50      No     42     2     0   |
    |                                                             3     0     0   |
    +-----------------------------------------------------------------------------+
    |   0 MIG 4g.40gb         5    1/1         39.50      No     56     2     0   |
    |                                                             4     0     0   |
    +-----------------------------------------------------------------------------+
    |   0 MIG 7g.80gb         0    1/1         79.25      No     98     0     0   |
    |                                                             7     1     1   |
    +-----------------------------------------------------------------------------+
    $ nvidia-smi mig -i 0 -lgip
    +-----------------------------------------------------------------------------+
    | GPU instance profiles:                                                      |
    | GPU   Name             ID    Instances   Memory     P2P    SM    DEC   ENC  |
    |                              Free/Total   GiB              CE    JPEG  OFA  |
    |=============================================================================|
    |   0 MIG 1g.10gb        19    7/7         9.50       No     14     0     0   |
    |                                                             1     0     0   |
    +-----------------------------------------------------------------------------+
    |   0 MIG 1g.10gb+me     20    1/1         9.50       No     14     0     0   |
    |                                                             1     1     1   |
    +-----------------------------------------------------------------------------+
    |   0 MIG 2g.20gb        14    3/3         19.50      No     28     1     0   |
    |                                                             2     0     0   |
    +-----------------------------------------------------------------------------+
    |   0 MIG 3g.40gb         9    2/2         39.50      No     42     2     0   |
    |                                                             3     0     0   |
    +-----------------------------------------------------------------------------+
    |   0 MIG 4g.40gb         5    1/1         39.50      No     56     2     0   |
    |                                                             4     0     0   |
    +-----------------------------------------------------------------------------+
    |   0 MIG 7g.80gb         0    1/1         79.25      No     98     0     0   |
    |                                                             7     1     1   |
    +-----------------------------------------------------------------------------+
    코드블록. MIG GPU Instance 프로파일 목록
참고
A100 GPU Instance 프로파일은 NVIDIA A100 MIG Profile의 예시를 참고하세요.
MIG Device Naming
그림. MIG Device Naming
Profile NameFraction of MemoryFraction of SMsHardware UnitsL2 Cache SizeNumber of Instances Available
MIG 1g.10gb1/81/70 NVDECs /0 JPEG /0 OFA1/87
MIG 1g.10gb+me1/81/71 NVDEC /1 JPEG /1 OFA1/81 (A single 1g profile can include media extensions)
MIG 2g.20gb2/82/71 NVDECs /0 JPEG /0 OFA2/83
MIG 3g.40gb4/83/72 NVDECs /0 JPEG /0 OFA4/82
MIG 4g.40gb4/84/72 NVDECs /0 JPEG /0 OFA4/81
MIG 7g.80gbFull7/75 NVDECs /1 JPEG /1 OFAFull1
표. NVIDIA A100 MIG Profile
참고
MIG 1g.10gb+me 프로파일은 R470 드라이버와 함께 시작하는 경우에만 사용할 수 있습니다.
  1. MIG GPU Instance를 생성한 후 확인하세요.
    • GPU Instance 생성

      배경색 변경
      $ nvidia-smi mig -i [GPU ID] -cgi [Profile ID]
      $ nvidia-smi mig -i [GPU ID] -cgi [Profile ID]
      코드블록. nvidia-smi 명령어 - GPU Instance 생성
      배경색 변경
      $ nvidia-smi mig -i 0 -cgi 0
      Successfully created GPU instance ID 0 on GPU 0 using profile MIG 7g.80gb (ID 0)
      $ nvidia-smi mig -i 0 -cgi 0
      Successfully created GPU instance ID 0 on GPU 0 using profile MIG 7g.80gb (ID 0)
      코드블록. nvidia-smi 명령어 - GPU Instance 생성 예시

    • GPU Instance 확인

      배경색 변경
      $ nvidia-smi mig -i [GPU ID] -lgi
      $ nvidia-smi mig -i [GPU ID] -lgi
      코드블록. nvidia-smi 명령어 - GPU Instance 확인
      배경색 변경
      $ nvidia-smi mig -i 0 -lgi
      +--------------------------------------------------------+
      | GPU instances:                                         |
      | GPU   Name               Profile  Instance  Placement  |
      |                            ID       ID      Start:Size |
      |========================================================|
      |   0  MIG 7g.80gb            0        0         0:8     |
      +--------------------------------------------------------+
      $ nvidia-smi mig -i 0 -lgi
      +--------------------------------------------------------+
      | GPU instances:                                         |
      | GPU   Name               Profile  Instance  Placement  |
      |                            ID       ID      Start:Size |
      |========================================================|
      |   0  MIG 7g.80gb            0        0         0:8     |
      +--------------------------------------------------------+
      코드블록. nvidia-smi 명령어 - GPU Instance 확인 예시

Compute Instance 생성

GPU Instance를 생성하였다면, Compute Instance를 생성할 수 있습니다.

  1. 생성할 수 있는 MIG Compute Instance 프로파일을 확인하세요.

    배경색 변경
    $ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] -lcip
    $ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] -lcip
    코드블록. nvidia-smi 명령어 - MIG Compute Instance 프로파일 확인
    배경색 변경
    $ nvidia-smi mig -i 0 -gi 0 -lcip
    +---------------------------------------------------------------------------------+
    | Compute instance profiles:                                                      |
    | GPU     GPU     Name            Profile  Instances   Exclusive      Shared      |
    | GPU   Instance                     ID    Free/Total     SM       DEC  ENC  OFA  |
    |         ID                                                       CE   JPEG      |
    |=================================================================================|
    |   0      0      MIG 1c.7g.80gb     0      7/7           14       5    0    1    |
    |                                                                  7    1         |
    +---------------------------------------------------------------------------------+
    |   0      0      MIG 2c.7g.80gb     1      3/3           28       5    0    1    |
    |                                                                  7    1         |
    +---------------------------------------------------------------------------------+
    |   0      0      MIG 3c.7g.80gb     2      2/2           42       5    0    1    |
    |                                                                  7    1         |
    +---------------------------------------------------------------------------------+
    |   0      0      MIG 4c.7g.80gb     3      1/1           56       5    0    1    |
    |                                                                  7    1         |
    +---------------------------------------------------------------------------------+
    |   0      0      MIG 7g.80gb        4*     1/1           98       5    0    1    |
    |                                                                  7    1         |
    +---------------------------------------------------------------------------------+
    $ nvidia-smi mig -i 0 -gi 0 -lcip
    +---------------------------------------------------------------------------------+
    | Compute instance profiles:                                                      |
    | GPU     GPU     Name            Profile  Instances   Exclusive      Shared      |
    | GPU   Instance                     ID    Free/Total     SM       DEC  ENC  OFA  |
    |         ID                                                       CE   JPEG      |
    |=================================================================================|
    |   0      0      MIG 1c.7g.80gb     0      7/7           14       5    0    1    |
    |                                                                  7    1         |
    +---------------------------------------------------------------------------------+
    |   0      0      MIG 2c.7g.80gb     1      3/3           28       5    0    1    |
    |                                                                  7    1         |
    +---------------------------------------------------------------------------------+
    |   0      0      MIG 3c.7g.80gb     2      2/2           42       5    0    1    |
    |                                                                  7    1         |
    +---------------------------------------------------------------------------------+
    |   0      0      MIG 4c.7g.80gb     3      1/1           56       5    0    1    |
    |                                                                  7    1         |
    +---------------------------------------------------------------------------------+
    |   0      0      MIG 7g.80gb        4*     1/1           98       5    0    1    |
    |                                                                  7    1         |
    +---------------------------------------------------------------------------------+
    코드블록. MIG Compute Instance 프로파일 목록 예시

  2. MIG Compute Instance를 생성하고 확인하세요.

    • MIG Compute Instance 생성
      배경색 변경
      $ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] -cci [Compute Profile ID]
      $ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] -cci [Compute Profile ID]
      코드블록. nvidia-smi 명령어 - MIG Compute Instance 생성
      배경색 변경
      $ nvidia-smi mig -i 0 -gi 0 -cci 4
      Successfully created compute instance ID 0 on GPU instance ID 0 using profile MIG 7g.80gb(ID 4)
      $ nvidia-smi mig -i 0 -gi 0 -cci 4
      Successfully created compute instance ID 0 on GPU instance ID 0 using profile MIG 7g.80gb(ID 4)
      코드블록. nvidia-smi 명령어 - MIG Compute Instance 생성 예시
    • MIG Compute Instance 확인
      배경색 변경
      $ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] –lci
      $ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] –lci
      코드블록. nvidia-smi 명령어 - MIG Compute Instance 확인
      배경색 변경
      $ nvidia-smi mig -i 0 -gi 0 –lci
      +-----------------------------------------------------------------+
      | Compute instance profiles:                                      |
      | GPU     GPU     Name            Profile  Instances   Placement  |
      | GPU   Instance                     ID      ID        Start:Size |
      |         ID                                                      |
      |=================================================================|
      |   0      0      MIG 7g.80gb         4       0            0:7    |
      +-----------------------------------------------------------------+
      $ nvidia-smi mig -i 0 -gi 0 –lci
      +-----------------------------------------------------------------+
      | Compute instance profiles:                                      |
      | GPU     GPU     Name            Profile  Instances   Placement  |
      | GPU   Instance                     ID      ID        Start:Size |
      |         ID                                                      |
      |=================================================================|
      |   0      0      MIG 7g.80gb         4       0            0:7    |
      +-----------------------------------------------------------------+
      코드블록. MIG Compute Instance 확인 예시
      배경색 변경
      $ nvidia-smi –L
      GPU 0: NVIDIA A100-SXM-80GB (UUID: GPU-c956838f-494a-92b2-6818-56eb28fe25e0)
        MIG 7g.80gb     Device  0: (UUID: MIG-53e20040-758b-5ecb-948e-c626d03a9a32)
      $ nvidia-smi –L
      GPU 0: NVIDIA A100-SXM-80GB (UUID: GPU-c956838f-494a-92b2-6818-56eb28fe25e0)
        MIG 7g.80gb     Device  0: (UUID: MIG-53e20040-758b-5ecb-948e-c626d03a9a32)
      코드블록. nvidia-smi 명령어 - GPU 상태 확인 (1)
      배경색 변경
      $ nvidia-smi
      Mon Sep 27 09:52:17 2021
      +-----------------------------------------------------------------------------+
      | NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
      |-------------------------------+----------------------+----------------------|
      | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
      | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
      |                               |                      |               MIG M. |
      |===============================+======================+======================|
      |   0  NVDIA A100-SXM...  Off   | 00000000:05:00.0 Off |                   On |
      | N/A   32C   P0    49W / 400W  |      0MiB / 81251MiB |     N/A      Default |
      |                               |                      |              Enabled |
      +-------------------------------+----------------------+----------------------+
      
      +-----------------------------------------------------------------------------+
      | MIG devices:                                                                |
      +-----------------------------------------------------------------------------+
      |  GPU  GI  CI  MIG |        Memory-Usage |        Vol|        Shared         |
      |       ID  ID  Dev |          BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
      |                   |                     |        ECC|                       |
      |=============================================================================|
      |   0    0   0    0 |     0MiB / 81251MiB | 98      0 |  7   0    5    1    1 |
      |                   |     1MiB / 13107... |           |                       |
      +-----------------------------------------------------------------------------+
      +-----------------------------------------------------------------------------+
      | Processes:                                                                  |
      |  GPU   GI   CI       PID   Type   Process name                   GPU Memory |
      |        ID   ID                                                   Usage      |
      |=============================================================================|
      | No running processes found                                                  |
      +-----------------------------------------------------------------------------+
      $ nvidia-smi
      Mon Sep 27 09:52:17 2021
      +-----------------------------------------------------------------------------+
      | NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
      |-------------------------------+----------------------+----------------------|
      | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
      | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
      |                               |                      |               MIG M. |
      |===============================+======================+======================|
      |   0  NVDIA A100-SXM...  Off   | 00000000:05:00.0 Off |                   On |
      | N/A   32C   P0    49W / 400W  |      0MiB / 81251MiB |     N/A      Default |
      |                               |                      |              Enabled |
      +-------------------------------+----------------------+----------------------+
      
      +-----------------------------------------------------------------------------+
      | MIG devices:                                                                |
      +-----------------------------------------------------------------------------+
      |  GPU  GI  CI  MIG |        Memory-Usage |        Vol|        Shared         |
      |       ID  ID  Dev |          BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
      |                   |                     |        ECC|                       |
      |=============================================================================|
      |   0    0   0    0 |     0MiB / 81251MiB | 98      0 |  7   0    5    1    1 |
      |                   |     1MiB / 13107... |           |                       |
      +-----------------------------------------------------------------------------+
      +-----------------------------------------------------------------------------+
      | Processes:                                                                  |
      |  GPU   GI   CI       PID   Type   Process name                   GPU Memory |
      |        ID   ID                                                   Usage      |
      |=============================================================================|
      | No running processes found                                                  |
      +-----------------------------------------------------------------------------+
      코드블록. nvidia-smi 명령어 - GPU 상태 확인 (2)

MIG 사용

  1. MIG Instance를 사용해 Job을 수행하세요.
    • 작업 수행 예시
      배경색 변경
      $ docker run --gpus '"device=[GPU ID]:[MIG ID]"' -rm nvcr.io/nvidia/cuda nvidia-smi
      $ docker run --gpus '"device=[GPU ID]:[MIG ID]"' -rm nvcr.io/nvidia/cuda nvidia-smi
      코드블록. 작업 수행 예시
    • 아래와 같이 작업을 수행한 예시를 확인해볼 수 있습니다.
      배경색 변경
      $ docker run --gpus '"device=0:0"' -rm -it --network=host --shm-size=1g --ipc=host -v /root/.ssh/:/root/.ssh
      
      ================
      == TensorFlow ==
      ================
      
      NVIDIA Release 21.08-tf1 (build 26012104)
      TensorFlow Version 1.15.5
      
      Container image Copyright (c) 2021, NVIDIA CORPORATION. All right reserved.
      ...
      
      # Python 프로세스 실행
      root@d622a93c9281:/workspace# python /workspace/nvidia-examples/cnn/resnet.py --num_iter 100 
      ...
      PY 3.8.10 (default, Jun 2 2021, 10:49:15)
      [GCC 9.4.0]
      TF 1.15.5
      ...
      $ docker run --gpus '"device=0:0"' -rm -it --network=host --shm-size=1g --ipc=host -v /root/.ssh/:/root/.ssh
      
      ================
      == TensorFlow ==
      ================
      
      NVIDIA Release 21.08-tf1 (build 26012104)
      TensorFlow Version 1.15.5
      
      Container image Copyright (c) 2021, NVIDIA CORPORATION. All right reserved.
      ...
      
      # Python 프로세스 실행
      root@d622a93c9281:/workspace# python /workspace/nvidia-examples/cnn/resnet.py --num_iter 100 
      ...
      PY 3.8.10 (default, Jun 2 2021, 10:49:15)
      [GCC 9.4.0]
      TF 1.15.5
      ...
      코드블록. 작업 수행 결과
  2. GPU 사용률을 확인하세요. (JOB 프로세스 생성)
    • Job이 구동될 때 MIG 디바이스에 프로세스가 할당되고 사용률이 증가하는 것을 확인할 수 있습니다.
      배경색 변경
      $ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] -lcip
      $ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] -lcip
      코드블록. nvidia-smi 명령어 - GPU 사용률 확인
    • 아래와 같이 GPU 사용률을 확인할 수 있습니다.
      배경색 변경
      +-----------------------------------------------------------------------------+
      | MIG devices:                                                                |
      +-----------------------------------------------------------------------------+
      |  GPU  GI  CI  MIG |        Memory-Usage |        Vol|        Shared         |
      |       ID  ID  Dev |          BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
      |                   |                     |        ECC|                       |
      |=============================================================================|
      |   0    0   0    0 | 66562MiB / 81251MiB | 98      0 |  7   0    5    1    1 |
      |                   |     5MiB / 13107... |           |                       |
      +-----------------------------------------------------------------------------+
      +-----------------------------------------------------------------------------+
      | Processes:                                                                  |
      |  GPU   GI   CI       PID   Type   Process name                   GPU Memory |
      |        ID   ID                                                   Usage      |
      |=============================================================================|
      |   0     0    0     17483      C   python                           66559MiB |
      +-----------------------------------------------------------------------------+
      +-----------------------------------------------------------------------------+
      | MIG devices:                                                                |
      +-----------------------------------------------------------------------------+
      |  GPU  GI  CI  MIG |        Memory-Usage |        Vol|        Shared         |
      |       ID  ID  Dev |          BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
      |                   |                     |        ECC|                       |
      |=============================================================================|
      |   0    0   0    0 | 66562MiB / 81251MiB | 98      0 |  7   0    5    1    1 |
      |                   |     5MiB / 13107... |           |                       |
      +-----------------------------------------------------------------------------+
      +-----------------------------------------------------------------------------+
      | Processes:                                                                  |
      |  GPU   GI   CI       PID   Type   Process name                   GPU Memory |
      |        ID   ID                                                   Usage      |
      |=============================================================================|
      |   0     0    0     17483      C   python                           66559MiB |
      +-----------------------------------------------------------------------------+
      코드블록. GPU 사용률 확인 예시

MIG Instance 삭제 및 해제하기

MIG Instance를 삭제하고 MIG를 해제하려면 다음 절차를 따르세요.

MIG 해제 순서
Compute Instance 삭제 → GPU Instance 삭제 → MIG 기능 해제(비활성화)

Compute Instance 삭제

  • Compute Instance를 삭제하세요.
    배경색 변경
    $ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] –dci
    $ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] -ci [Compute Instance] –dci
    $ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] –dci
    $ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] -ci [Compute Instance] –dci
    코드블록. nvidia-smi 명령어 - Compute Instance 삭제
    배경색 변경
    $ nvidia-smi mig -i 0 -gi 0 –lci
    +-----------------------------------------------------------------+
    | Compute instance profiles:                                      |
    | GPU     GPU     Name            Profile  Instances   Placement  |
    | GPU   Instance                     ID      ID        Start:Size |
    |         ID                                                      |
    |=================================================================|
    |   0      0      MIG 7g.80gb         4       0            0:7    |
    +-----------------------------------------------------------------+
    $ nvidia-smi mig -i 0 -gi 0 –lci
    +-----------------------------------------------------------------+
    | Compute instance profiles:                                      |
    | GPU     GPU     Name            Profile  Instances   Placement  |
    | GPU   Instance                     ID      ID        Start:Size |
    |         ID                                                      |
    |=================================================================|
    |   0      0      MIG 7g.80gb         4       0            0:7    |
    +-----------------------------------------------------------------+
    코드블록. MIG Compute Instance 확인 예시
    배경색 변경
    $ nvidia-smi mig -i 0 -gi 0 –dci
    Successfully destroyed compute instance ID  0 from GPU instance ID  0
    $ nvidia-smi mig -i 0 -gi 0 –dci
    Successfully destroyed compute instance ID  0 from GPU instance ID  0
    코드블록. Compute Instance 삭제 예시
    배경색 변경
    $ nvidia-smi mig -i 0 -gi 0 –lci
    No compute instances found: Not found
    $ nvidia-smi mig -i 0 -gi 0 –lci
    No compute instances found: Not found
    코드블록. Compute Instance 삭제 확인

GPU Instance 삭제

  • GPU Instance를 삭제하세요.
    배경색 변경
    $ nvidia-smi mig -i [GPU ID] –dgi
    $ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] –dgi
    $ nvidia-smi mig -i [GPU ID] –dgi
    $ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] –dgi
    코드블록. nvidia-smi 명령어 - GPU Instance 삭제
    배경색 변경
    $ nvidia-smi mig -i 0 -lgi
    +--------------------------------------------------------+
    | GPU instances:                                         |
    | GPU   Name               Profile  Instance  Placement  |
    |                            ID       ID      Start:Size |
    |========================================================|
    |   0  MIG 7g.80gb            0        0         0:8     |
    +--------------------------------------------------------+
    $ nvidia-smi mig -i 0 -lgi
    +--------------------------------------------------------+
    | GPU instances:                                         |
    | GPU   Name               Profile  Instance  Placement  |
    |                            ID       ID      Start:Size |
    |========================================================|
    |   0  MIG 7g.80gb            0        0         0:8     |
    +--------------------------------------------------------+
    코드블록. nvidia-smi 명령어 - GPU Instance 확인 예시
    배경색 변경
    $ nvidia-smi mig -i 0 -dgi
    Successfully destroyed GPU instance ID  0 from GPU  0
    $ nvidia-smi mig -i 0 -dgi
    Successfully destroyed GPU instance ID  0 from GPU  0
    코드블록. nvidia-smi 명령어 - GPU Instance 삭제 예시
    배경색 변경
    $ nvidia-smi mig -i 0 -lgi
    No GPU instances found: Not found
    $ nvidia-smi mig -i 0 -lgi
    No GPU instances found: Not found
    코드블록. nvidia-smi 명령어 - GPU Instance 삭제 예시

MIG 기능 해제(비활성화)

  • MIG를 비활성화(Disable)한 후 재부팅하세요.
    배경색 변경
    $ nvidia-smi -mig 0
    Disabled MIG Mode for GPU 00000000:05:00.0
    
    All done.
    $ nvidia-smi -mig 0
    Disabled MIG Mode for GPU 00000000:05:00.0
    
    All done.
    코드블록. nvidia-smi 명령어 - MIG 비활성화
    배경색 변경
    $ nvidia-smi
    Mon Sep 30 05:18:28 2021
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
    |-------------------------------+----------------------+----------------------|
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |                               |                      |               MIG M. |
    |===============================+======================+======================|
    |   0  NVDIA A100-SXM...  Off   | 00000000:05:00.0 Off |                    0 |
    | N/A   33C   P0    60W / 400W  |      0MiB / 81251MiB |      0%      Default |
    |                               |                      |             Disabled |
    +-------------------------------+----------------------+----------------------+
    +-----------------------------------------------------------------------------+
    | MIG devices:                                                                |
    +-----------------------------------------------------------------------------+
    |  GPU  GI  CI  MIG |        Memory-Usage |        Vol|        Shared         |
    |       ID  ID  Dev |          BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
    |                   |                     |        ECC|                       |
    |=============================================================================|
    | No MIG devices found                                                        |
    +-----------------------------------------------------------------------------+
    +-----------------------------------------------------------------------------+
    | Processes:                                                                  |
    |  GPU   GI   CI       PID   Type   Process name                   GPU Memory |
    |        ID   ID                                                   Usage      |
    |=============================================================================|
    | No running processes found                                                  |
    +-----------------------------------------------------------------------------+
    $ nvidia-smi
    Mon Sep 30 05:18:28 2021
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
    |-------------------------------+----------------------+----------------------|
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |                               |                      |               MIG M. |
    |===============================+======================+======================|
    |   0  NVDIA A100-SXM...  Off   | 00000000:05:00.0 Off |                    0 |
    | N/A   33C   P0    60W / 400W  |      0MiB / 81251MiB |      0%      Default |
    |                               |                      |             Disabled |
    +-------------------------------+----------------------+----------------------+
    +-----------------------------------------------------------------------------+
    | MIG devices:                                                                |
    +-----------------------------------------------------------------------------+
    |  GPU  GI  CI  MIG |        Memory-Usage |        Vol|        Shared         |
    |       ID  ID  Dev |          BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
    |                   |                     |        ECC|                       |
    |=============================================================================|
    | No MIG devices found                                                        |
    +-----------------------------------------------------------------------------+
    +-----------------------------------------------------------------------------+
    | Processes:                                                                  |
    |  GPU   GI   CI       PID   Type   Process name                   GPU Memory |
    |        ID   ID                                                   Usage      |
    |=============================================================================|
    | No running processes found                                                  |
    +-----------------------------------------------------------------------------+
    코드블록. nvidia-smi 명령어 - GPU 상태 확인
Keypair 관리하기
GPU Server에서 NVSwitch 사용하기