The page has been translated by Gen AI.

Using NVSwitch on GPU Server

After creating a GPU Server, you can enable the NVSwitch feature in the GPU Server’s VM (Guest OS) and quickly use P2P (GPU to GPU) communication between GPUs.

Exploring NVIDIA NVSwitch for Multi GPU

NVIDIA A100 GPU server is a multi-GPU based on the NVIDIA Ampere architecture, with 8 Ampere 80 GB GPUs installed on the baseboard. The GPUs installed on the baseboard are connected to 6 NVSwitches via NVLink ports. Communication between GPUs on the baseboard is done using the full 600 GBps bandwidth. For this reason, the 8 GPUs installed on the A100 GPU server can be connected and operated like one, thereby maximizing GPU-to-GPU usage.

  • NVLink(25 GBps) 12 Lane 8 GPU configuration
NVLink(25 GBps) 12 lanes 8 GPU configuration diagram
Figure. NVLink(25 GBps) 12 lanes 8 GPU configuration diagram
  • NVSwitch(600 GBps) 6 units 8 GPU configuration diagram
NVSwitch(600 GBps) 6 units 8 GPU configuration diagram
Figure. NVSwitch(600 GBps) 6 units 8 GPU configuration diagram

Create GPU NVSwitch

To use the GPU NVSwitch feature, create a GPU Server service on the Samsung Cloud Platform, create a VM Instance (GuestOS) with 8 A100 GPUs assigned, and activate the Fabricmanager.

주의
  • NVSwitch can only be activated and used for products with 8 A100 GPUs assigned to a single GPU server (g1v128a8 (vCPU 128 | Memory 1920G | A100(80GB)*8)).
  • Currently, GPU Server created with Windows OS does not support NVSwitch (Fabricmanager).

NVSwitch Installation and Operation Check (Fabric Manager Activation)

To operate NVSwitch, install Fabricmanager on the GPU Instance and follow the next procedure.

  1. Install NVIDIA GPU Driver (470.52.02 Version) on the GPU server.

    Color mode
    $ add-apt-repository ppa:graphics-drivers/ppa
    $ apt-get update
    $ apt-get install nvidia-driver-470-server
    $ add-apt-repository ppa:graphics-drivers/ppa
    $ apt-get update
    $ apt-get install nvidia-driver-470-server
    Code Block. NVIDIA GPU Driver Installation

  2. Install and run NVIDIA Fabric Manager (470 Version) on the GPU server (For NVSwitch).

    Color mode
    $ apt-get install cuda-drivers-fabricmanager-470
    $ systemctl enable nvidia-fabricmanager
    $ systemctl start nvidia-fabricmanager
    $ apt-get install cuda-drivers-fabricmanager-470
    $ systemctl enable nvidia-fabricmanager
    $ systemctl start nvidia-fabricmanager
    Code Block. NVIDIA Fabric Manager Installation and Operation

  3. Check the status of NVIDIA Fabric Manager running on the GPU server.

    • Normal operation indicates active (running)
      Color mode
      $ systemctl status nvidia-fabricmanager
      $ systemctl status nvidia-fabricmanager
      Code Block. Check NVIDIA Fabric Manager Operation Status
NVSwitch installation - Checking the operation status of Fabric Manager
Figure. NVSwitch installation - Checking the operation status of Fabric Manager
  1. Check the NVSwitch operation status on the GPU server.
    • Normal operation indicates NV12
      Color mode
      $ nvidia-smi topo --matrix
      $ nvidia-smi topo --matrix
      Code block. NVSwitch operation status check
NVSwitch Installation - Checking NVSwitch Operation Status
Figure. NVSwitch Installation - Checking NVSwitch Operation Status
Using Multi-instance GPU in GPU Server
Keypair Management