This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

How-to guides

1: Image Management
2: Using Multi-instance GPU in GPU Server
3: Using NVSwitch on GPU Server
4: Keypair Management
5: ServiceWatch Agent Install

The user can enter the required information of the GPU Server through the Samsung Cloud Platform Console, select detailed options, and create the service.

Create GPU Server

You can create and use GPU Server services from the Samsung Cloud Platform Console.

If you want to create a GPU Server, follow the steps below.

All Services > Compute > GPU Server Please click the menu. Navigate to the GPU Server’s Service Home page.
Service Home on the page, click the Create GPU Server button. Navigate to the Create GPU Server page.

GPU Server creation On the page, enter the information required to create the service, and select detailed options.

Image and version selection Select the required information in the area.
Category
Required or not
Detailed description
Image Required Select provided image type
Ubuntu
Image version Required Select version of the chosen image
Provides a list of server image versions offered
Table. GPU Server image and version selection input items

Category	Required or not	Detailed description
Image	Required	Select provided image type Ubuntu
Image version	Required	Select version of the chosen image Provides a list of server image versions offered

Service Information Input Enter or select the required information in the area.

Category	Required or not	Detailed description
Server count	Required	Number of GPU Server servers to create simultaneously Only numbers can be entered, and input a value between 1 and 100
Service Type > Server Type	Required	GPU Server Server Type Indicates the server specifications of GPU type, select a server that includes 1, 2, 4, or 8 GPUs For detailed information about the server types provided by GPU Server, refer to GPU Server Server Type
Service Type > Planned Compute	Select	Resource status with Planned Compute set In Use: Number of resources with Planned Compute set that are currently in use Configured: Number of resources with Planned Compute set Coverage Preview: Amount applied per resource by Planned Compute Planned Compute Service Application: Go to the Planned Compute service application page For more details, refer to Apply for Planned Compute
Block Storage	Required	Set the Block Storage used by the GPU Server according to its purpose Basic: Area where the OS is installed and used Capacity can be entered in Units (minimum capacity varies depending on the OS image type) RHEL: values between 3 and 1,536 can be entered Ubuntu: values between 3 and 1,536 can be entered SSD: high-performance general volume HDD: general volume SSD/HDD_KMS: additional encrypted volume using Samsung Cloud Platform KMS (Key Management System) encryption key Encryption can only be applied at initial creation (cannot be changed after creation) Performance degradation occurs when using SSD_KMS disk type Additional: Used when additional user space is needed outside the OS area After selecting Use, enter the storage type and capacity To add storage, click the + button (up to 25 can be added); to delete, click the x button Capacity can be entered in Units, values between 1 and 1,536 Since 1 Unit equals 8 GB, 8 to 12,288 GB can be created SSD: high-performance general volume HDD: general volume SSD/HDD_KMS: additional encrypted volume using Samsung Cloud Platform KMS (Key Management System) encryption key Encryption can only be applied at initial creation (cannot be changed after creation) Performance degradation may occur when using SSD_KMS disk type For detailed information on each Block Storage type, refer to Create Block Storage Delete on termination: If Delete on Termination is set to Enabled, the volume will be terminated together when the server is terminated Volumes with snapshots are not deleted even if Delete on termination is set to Enabled A multi-attach volume is deleted only when the server being terminated is the last remaining server attached to the volume

Table. GPU Server Service Configuration Items

Required Information Input Please enter or select the required information in the area.

Category	Required	Detailed description
Server Name	Required	Enter a name to distinguish the server when the selected server count is 1 Set hostname with the entered server name Use English letters, numbers, spaces, and special characters (`-` `_`) within 63 characters
Server name Prefix	Required	Enter Prefix to distinguish each server generated when the selected number of servers is 2 or more Automatically generated as user input value (prefix) + ‘`-#`’ format Enter within 59 characters using English letters, numbers, spaces, and special characters (`-`, `_`)
Network Settings > Create New Network Port	Required	Set the network where the GPU Server will be installed Select a pre-created VPC. General Subnet: Select a pre-created General Subnet IP can be set to auto-generate or user input; if input is selected, the user can directly enter the IP NAT: Can be used only if there is one server and the VPC has an Internet Gateway attached. Use checked allows selection of NAT IP NAT IP: Select NAT IP If there is no NAT IP to select, click the Create New button to generate a Public IP Refresh button click to view and select the created Public IP Creating a Public IP incurs charges according to the Public IP pricing policy Local Subnet (Optional): Select Local Subnet Use Not a required element for creating the service A pre-created Local Subnet must be selected IP can be set to auto-generate or user input; input selected allows user to directly enter IP Security Group: Settings required to access the server Select: Choose a pre-created Security Group Create New: If there is no applicable Security Group, it can be created separately in the Security Group service Up to 5 can be selected If no Security Group is set, all access is blocked by default A Security Group must be set to allow required access
Network Settings > Existing Network Port Specification	Required	Set the network where the GPU Server will be installed Select a pre-created VPC General Subnet: Select a pre-created General Subnet and Port NAT: Can be used only if there is one server and the VPC has an Internet Gateway attached; checking use allows selection of a NAT IP. NAT IP: Select NAT IP If there is no NAT IP to select, click the New Creation button to generate a Public IP Click the Refresh button to view and select the created Public IP Local Subnet (Optional): Select Use of Local Subnet Select a pre-created Local Subnet and Port
Keypair	Required	User authentication method to use when connecting to the server New Creation: If a new Keypair is needed, create a new one Refer to Create Keypair for how to create a new Keypair List of default login accounts by OS RHEL: cloud-user Ubuntu: ubuntu

Table. GPU Server required information input items

Additional Information Input Enter or select the required information in the area.

Category	Required	Detailed description
Lock	Select	Set whether to use Lock Using Lock prevents actions such as server termination, start, stop from being executed, preventing malfunctions caused by mistakes
Init script	Select	Script executed when the server starts The Init script must be written as a Batch script for Windows, a Shell script or cloud‑init for Linux, depending on the image type. Up to 45,000 bytes can be entered
Tag	Select	Add Tag Up to 50 can be added per resource After clicking the Add Tag button, enter or select Key, Value values

Category

Required

Detailed description

Lock

Select

Set whether to use Lock

Using Lock prevents actions such as server termination, start, stop from being executed, preventing malfunctions caused by mistakes

Init script

Select

Script executed when the server starts

The Init script must be written as a Batch script for Windows, a Shell script or cloud‑init for Linux, depending on the image type.

Up to 45,000 bytes can be entered

Tag

Select

Add Tag

Up to 50 can be added per resource

After clicking the Add Tag button, enter or select Key, Value values

Table. GPU Server Additional Information Input Items

Summary Check the detailed information and estimated billing amount generated in the panel, and click the Complete button.
- When creation is complete, check the created resources on the GPU Server list page.

GPU Server Check detailed information

GPU Server service can view and edit the full resource list and detailed information. GPU Server Detail page consists of Detail Information, Tags, Job History tabs.

To view detailed information about the GPU Server service, follow the steps below.

All Services > Compute > GPU Server Click the menu. Go to the GPU Server’s Service Home page.
Click the GPU Server menu on the Service Home page. Navigate to the GPU Server list page.

GPU Server List Click the resource to view detailed information on the page. GPU Server Details You will be taken to the page.

GPU Server Details page displays status information and additional feature information, and consists of Details, Tags, Job History tabs.

GPU Server add-on features for detailed information, refer to GPU Server management add-on features.

Category	Detailed description
GPU Server status	Status of GPU Server created by the user Build: State where Build command has been delivered Building: Build in progress Networking: Server creation in‑progress process Scheduling: Server creation in‑progress process Block_Device_Mapping: Connecting Block Storage during server creation Spawning: State where server creation process is ongoing Active: Usable state Powering_off: State when stop request is made Deleting: Server deletion in progress Reboot_Started: Reboot in progress state Error: Error state Migrating: State where server is migrating to another host Reboot: State where Reboot command has been delivered Rebooting: Restart in progress Rebuild: State where Rebuild command has been delivered Rebuilding: State when Rebuild is requested Rebuild_Spawning: State where Rebuild process is ongoing Resize: State where Resize command has been delivered Resizing: Resize in progress Resize_Prep: State when server type modification is requested Resize_Migrating: State where server is moving to another host while Resize is in progress Resize_Migrated: State where server has completed moving to another host while Resize is in progress Resize_Finish: Resize completed Revert_Resize: Resize or migration of the server failed for some reason. The target server is cleaned up and the original server restarts Shutoff: State when Powering off is completed Verity_Resize: After Resize_Prep due to server type modification request, state where server type is confirmed / can choose to revert server type Resize_Reverting: State when server type revert is requested Resize_Confirming: State where server’s Resize request is being confirmed
Server Control	Button to change server status Start: Start a stopped server Stop: Stop a running server Restart: Restart a running server
Image Generation	Create user custom image using the current server’s image
Console Log	View current server’s console log You can check the console log output from the current server. For more details, refer to Check Console Log.
Dump creation	Create a dump of the current server The dump file is created inside the GPU Server For detailed dump creation method, refer to Create Dump
Rebuild	All data and settings of the existing server are deleted, and a new server is set up For details, refer to Perform Rebuild.
Service Cancellation	Button to cancel the service

Table. GPU Server status information and additional features

Detailed Information

GPU Server list page, you can view the detailed information of the selected resource and, if necessary, edit the information.

Category	Detailed description
service	service name
Resource Type	Resource Type
SRN	Unique resource ID in Samsung Cloud Platform In the GPU Server service, it means GPU Server SRN
Resource Name	Resource Name In the GPU Server service, it refers to the GPU Server name
Resource ID	Unique resource ID in the service
Creator	User who created the service
Creation time	Service creation time
Editor	User who edited the service information
Modification Date	Date Service Information Was Modified
Server name	Server name
Server Type	vCPU, Memory, GPU Information Display If you need to change to a different server type, click the Edit button to set
Image Name	Service OS Image and Version
Lock	Display Lock usage status If you need to change the Lock attribute value, click the Edit button to set
Server Group	Server group name the server belongs to
Keypair name	Server authentication information set by the user
Planned Compute	Resource status with Planned Compute set For more details, refer to Apply for Planned Compute.
Network	Network information of GPU Server VPC, general Subnet, IP, NAT IP, NAT IP status, Security Group If you need to change the NAT IP value, you can set it by clicking the Edit button If you need to change the Security Group, you can set it by clicking the Edit button Add as new network: select a general Subnet and IP You can select another general Subnet within the same VPC IP can be set to auto-generate or user input; if input is selected, the user can directly enter the IP Add with existing port: select a pre-created general Subnet and port
Local Subnet	GPU Server’s Local Subnet Information Local Subnet, Local Subnet IP, Security Group If a Security Group change is needed, you can click the Edit button to set it Add as New Network: Select Local Subnet and IP You can select a different Local Subnet within the same VPC IP can be auto-generated or user input; if Input is selected, the user enters the IP directly Add with Existing Port: Select a pre-created Local Subnet and port
Block Storage	Information of Block Storage connected to the server Volume ID, Volume Name, Type, Capacity, Connection Info, Category, Delete on termination, Status Add: Additional Block Storage can be connected if needed Modify Delete on termination: Modify Delete on termination value Disconnect: Disconnect the additionally connected Block Storage

Table. GPU Server detailed information tab items

Category	Detailed description
Tag List	Tag List Tag’s Key, Value information can be checked Up to 50 tags can be added per resource When entering tags, you can search and select from existing Key and Value lists

Work History

You can view the job history of the selected resource on the GPU Server List page.

Category	Detailed description
Work History List	Resource Change History Work date and time, Resource ID, Resource name, Work details, Event topic, Work result, Verify worker information

Table. Work History Tab Detailed Information Items

GPU Server Operation Control

If you need to control the operation of the generated GPU Server resources, you can perform the task on the GPU Server List or GPU Server Details page. You can start, stop, and restart a running server.

GPU Server Start

You can start a shutoff GPU Server. To start the GPU Server, follow the steps below.

All Services > Compute > GPU Server Please click the menu. Navigate to the GPU Server’s Service Home page.
Click the GPU Server menu on the Service Home page. Go to the GPU Server List page.
GPU Server List page, click the resource to start among the shutoff servers, and go to the GPU Server Details page.

GPU Server list page, you can start each resource via the right more button.
- After selecting multiple servers with the check box, you can control multiple servers simultaneously through the Start button at the top.

GPU Server Details On the page, click the Start button at the top to start the server. Check the changed server status in the Status Display item.
- When the GPU Server start is completed, the server status changes from Shutoff to Active.
- For detailed information about the GPU Server status, please refer to Check GPU Server detailed information.

GPU Server Stop

You can stop a GPU Server that is active. To stop the GPU Server, follow the steps below.

All Services > Compute > GPU Server Click the menu. Go to the Service Home page of GPU Server.
Service Home page, click the GPU Server menu. Navigate to the GPU Server List page.
GPU Server List page, click the resource to stop among the servers that are active (Active), and go to the GPU Server Details page.

GPU Server List on the page, you can stop each resource via the right More button.
- After selecting multiple servers with the checkbox, you can control multiple servers simultaneously through the Stop button at the top.

GPU Server Details page, click the top Stop button to start the server. Check the changed server status in the Status Display item.
- When GPU Server shutdown is completed, the server status changes from Active to Shutoff.
- For detailed information about the GPU Server status, please refer to Check GPU Server detailed information.

GPU Server Restart

You can restart the generated GPU Server. To restart the GPU Server, follow the steps below.

All Services > Compute > GPU Server Please click the menu. Navigate to the GPU Server’s Service Home page.
Click the GPU Server menu on the Service Home page. It navigates to the GPU Server List page.
GPU Server List page, click the resource to restart, and navigate to the GPU Server Details page.
- GPU Server List page, you can restart each resource by using the right More button.
- After selecting multiple servers with the checkbox, you can control multiple servers simultaneously through the Restart button at the top.
GPU Server Details on the page, click the Restart button at the top to start the server. Check the changed server status in the Status Display item.
- GPU Server during restart, the server status goes through Rebooting and finally changes to Active.
- For detailed information about the GPU Server status, please refer to Check GPU Server detailed information.

GPU Server Resource Management

If you need server control and management functions for the generated GPU Server resources, you can perform the work on the GPU Server Resource List or GPU Server Details page.

Image Create

You can create an image of a running GPU server.

Reference

This content provides instructions on how to create a user custom image using the image of a running GPU server.

GPU Server list or GPU Server details page, click the Create Image button to create a user Custom Image.

To create an Image of the GPU Server, follow the steps below.

All Services > Compute > GPU Server Click the menu. Navigate to the GPU Server’s Service Home page.
Service Home page, click the GPU Server menu. Go to the GPU Server list page.
Click the resource to create an Image on the GPU Server List page. It navigates to the GPU Server Details page.
GPU Server Details on the page, click the Image Generation button. Navigate to the Image Generation page.
- Service Information Input area, please enter the required information.
  Category
  Required
  Detailed description
  Image Name Required Image name to be generated
  Enter within 200 characters using English letters, numbers, spaces, and special characters (- _)
  Table. Image Service Information Input Items
Check the input information and click the Complete button.
- When creation is complete, check the created resources on the All Services > Compute > GPU Server > Image List page.

Category	Required	Detailed description
Image Name	Required	Image name to be generated Enter within 200 characters using English letters, numbers, spaces, and special characters (`-` `_`)

Notice

If you create an Image, the generated Image is stored in the Object Storage used as internal storage. Therefore, Object Storage usage fees are charged.
Active state GPU Server-generated image’s file system cannot guarantee integrity, so image creation after server shutdown is recommended.

ServiceWatch Enable Detailed Monitoring

Basically, the GPU Server is linked with the basic monitoring of ServiceWatch and the Virtual Server namespace. You can enable detailed monitoring as needed to more quickly identify and address operational issues. For more information about ServiceWatch, see the ServiceWatch Overview (/userguide/management/service_watch/overview/).

Reference

GPU Server provides basic and detailed monitoring in the same namespace as Virtual Server. GPU Server’s GPU metrics are scheduled to be provided by ServiceWatch Agent. (Scheduled for Dec 2025)

Caution

Basic monitoring is provided for free, but activating detailed monitoring incurs additional charges. Please be aware when using.

To enable detailed monitoring of ServiceWatch on the GPU Server, follow these steps.

All Services > Compute > GPU Server Please click the menu. Navigate to the GPU Server’s Service Home page.
Click the GPU Server menu on the Service Home page. Navigate to the GPU Server list page.
On the GPU Server List page, click the resource to enable ServiceWatch detailed monitoring. You will be taken to the GPU Server Details page.
GPU Server Details page, click the ServiceWatch detailed monitoring Edit button. You will be taken to the ServiceWatch Detailed Monitoring Edit popup.
ServiceWatch Detailed Monitoring Modification In the popup window, after selecting Enable, check the guidance text and click the Confirm button.
GPU Server Details page, check the ServiceWatch detailed monitoring items.

ServiceWatch Disable detailed monitoring

Caution

Disabling detailed monitoring is required for cost efficiency. Keep detailed monitoring enabled only when absolutely necessary, and disable detailed monitoring for the rest.

To disable the detailed monitoring of ServiceWatch on the GPU Server, follow the steps below.

All Services > Compute > GPU Server Please click the menu. Navigate to the GPU Server’s Service Home page.
Service Home page, click the GPU Server menu. Navigate to the GPU Server list page.
Click the resource to disable ServiceWatch detailed monitoring on the GPU Server List page. It navigates to the GPU Server Details page.
GPU Server Details page, click the ServiceWatch detailed monitoring Edit button. It moves to the ServiceWatch Detailed Monitoring Edit popup.
ServiceWatch Detailed Monitoring Edit in the popup window, after deselecting Enable, check the guide text and click the Confirm button.
GPU Server Details page, check the ServiceWatch detailed monitoring items.

GPU Server Management Additional Features

For GPU Server management, you can view Console logs, generate Dump, and Rebuild. To view Console logs, generate Dump, and Rebuild the GPU Server, follow the steps below.

Check console log

You can view the current console log of the GPU Server.

To check the console logs of the GPU Server, follow the steps below.

All Services > Compute > GPU Server Please click the menu. Navigate to the GPU Server’s Service Home page.
Click the GPU Server menu on the Service Home page. Navigate to the GPU Server List page.
On the GPU Server List page, click the resource to view the console log. Navigate to the GPU Server Details page.
GPU Server Details on the page, click the Console Log button. It will move to the Console Log popup window.
Console Log Check the console log displayed in the popup window.

Create Dump

To create a Dump file of the GPU Server, follow the steps below.

All Services > Compute > GPU Server Click the menu. Navigate to the GPU Server’s Service Home page.
Click the GPU Server menu on the Service Home page. Navigate to the GPU Server List page.
GPU Server List Click the resource to view detailed information on the page. GPU Server Details Navigate to the page.
GPU Server Details on the page Create Dump Click the button.
- The dump file is created inside the GPU Server.

Rebuild perform

You can delete all data and settings of the existing GPU Server and rebuild it on a new server.

To perform the Rebuild of the GPU Server, follow the steps below.

All Services > Compute > GPU Server Click the menu. Navigate to the Service Home page of GPU Server.
Click the GPU Server menu on the Service Home page. It navigates to the GPU Server List page.
GPU Server List page, click the resource to perform Rebuild. GPU Server Details page will be opened.
GPU Server Details on the page click the Rebuild button.
- During GPU Server Rebuild, the server status changes to Rebuilding, and when the Rebuild is completed, it returns to the state before the Rebuild.
- For detailed information about the GPU Server status, please refer to Check GPU Server detailed information.

GPU Server Cancel

If you cancel an unused GPU Server, you can reduce operating costs. However, if you cancel a GPU Server, the service currently running may be stopped immediately, so you should proceed with the cancellation after fully considering the impact that may occur when the service is interrupted.

Caution

Please note that data cannot be recovered after service termination.

To cancel the GPU Server, follow the steps below.

All Services > Compute > GPU Server Click the menu. Navigate to the GPU Server’s Service Home page.
Click the GPU Server menu on the Service Home page. Navigate to the GPU Server List page.
GPU Server List on the page, select the resource to cancel, and click the Cancel Service button.
- The termination of connected storage depends on the Delete on termination setting, so please refer to Termination Constraints.
When termination is completed, check on the GPU Server List page whether the resource has been terminated.

Termination Constraints

If the termination request for GPU Server cannot be processed, we will guide you with a popup window. Please refer to the cases below.

Cancellation not allowed

If File Storage is connected, please disconnect the File Storage connection first.
If LB Pool is connected please disconnect the LB Pool connection first.
When Lock is set please change the Lock setting to unused and try again.

The termination of attached storage depends on the Delete on termination setting.

Delete on termination Delete per setting

Delete on termination Whether the volume deletion also varies depending on the setting.
- Delete on termination If not set: Even if you terminate the GPU Server, the volume will not be deleted.
- Delete on termination when set: If you terminate the GPU Server, the volume will be deleted.
Volumes with a Snapshot will not be deleted even if Delete on termination is set.
Multi attach volume is deleted only when the server you are trying to delete is the last remaining server attached to the volume.

1 - Image Management

Users can enter the required information for the Image service within the GPU Server service and select detailed options through the Samsung Cloud Platform Console to create the respective service.

Image Generation

You can create an image of a running GPU Server. To create an image of a GPU Server, please refer to Create Image.

Image Check detailed information

Image service allows you to view and edit the full resource list and detailed information. Image detail page consists of detail information, tags, operation history tabs.

To view detailed information of the Image service, follow the steps below.

All Services > Compute > GPU Server Click the menu. Navigate to the Service Home page of GPU Server.
Click the Image menu on the Service Home page. Go to the Image List page.

Image List page, click the resource to view detailed information. You will be taken to the Image Detail page.

Image Details page displays status information and additional feature information, and consists of Detail Information, Tag, Work History tabs.

Category	Detailed description
Image Status	User-generated Image’s status Active: Available state Queued: When creating Image, Image is uploaded and waiting for processing Importing: When creating Image, Image is uploaded and being processed
Share to another Account	Image can be shared to another Account Image’s Visibility must be in Shared state to be able to share to another Account
Delete Image	Button to delete the Image If the Image is deleted, it cannot be recovered

Category

Detailed description

Image Status

User-generated Image’s status

Active: Available state

Queued: When creating Image, Image is uploaded and waiting for processing

Importing: When creating Image, Image is uploaded and being processed

Share to another Account

Image can be shared to another Account

Image’s Visibility must be in Shared state to be able to share to another Account

Delete Image

Button to delete the Image

If the Image is deleted, it cannot be recovered

Table. GPU Server Image status information and additional features

Detailed Information

Image list page allows you to view detailed information of the selected resource and edit the information if necessary.

Category	Detailed description
Service	Service Name
Resource Type	Resource Type
SRN	Unique resource ID in Samsung Cloud Platform Means the SRN of a GPU Server Image
Resource Name	Image Name
Resource ID	Image ID
Creator	User who created the Image
Creation date/time	Date/time when the image was created
Editor	User who edited the Image
Edit date/time	Date and time the image was edited
Image name	Image name
Minimum Disk	Image’s minimum disk capacity (GB) If you need to modify the minimum disk, click the Edit button to set it
Minimum RAM	Image’s minimum RAM capacity (GB)
OS type	Image’s OS type
OS hash algorithm	OS hash algorithm method
Visibility	Displays access permissions for images Private can be used only within the project, and Shared can be shared across projects
Protected	Select whether image deletion is prohibited If checked, it can prevent accidental deletion of images This setting can be changed after image creation
Image File URL	Image file URL uploaded when generating image Not displayed for images created via the image generation menu on the GPU Server detail page
Sharing Status	Status of sharing images with other Accounts Approved Account ID: ID of the Account that has been approved for sharing Modification Date/Time: The date/time when sharing was requested to another Account, if the sharing status later changes from Pending → Accepted it is updated to that date/time Status: Approval Status Accepted: Approved and being shared Pending: Waiting for approval Delete: Sharing has been stopped

Table. Image detailed information tab items

Category	Detailed description
Tag List	Tag List Tag’s Key and Value information can be checked Up to 50 tags can be added per resource When entering tags, you can search and select from previously created Key and Value lists

Work History

You can view the operation history of the selected resource on the Image list page.

Category	Detailed description
Work History List	Resource Change History Work date and time, Resource ID, Resource name, Work details, Event topic, Work result, Verify worker information

Table. GPU Server Image Job History Tab Detailed Information Items

Image Resource Management

Describes the control and management functions of the generated image.

To share the Image with another Account, follow the steps below.

Log in to the shared Account and click the All Services > Compute > GPU Server menu. Navigate to the GPU Server’s Service Home page.
Click the Image menu on the Service Home page. Go to the Image list page.
Click the Image to control on the Image List page. It moves to the Image Detail page.
Click the Share to another Account button. It navigates to the Share image to another Account page.
- Share to another Account feature allows you to share the Image with another Account. To share the Image with another Account, the Image’s Visibility must be Shared.

Share images to another Account On the page, enter the required information and click the Complete button.

Category	Required or not	Detailed description
Image Name	-	Name of the image to share Input not allowed
Image ID	-	Image ID to share Input not allowed
Shared Account ID	Required	Enter another Account ID to share Enter within 64 characters using English letters, numbers, special characters`-`

Table. Required input items for sharing images to another Account

Image Details page’s sharing status can be checked for information.
- At the initial request, the status is Pending, and when approval is completed in the account to be shared, it changes to Accepted.

Notice

Only the Image created by the current user’s Image file upload can be shared with another Account. If a Custom Image is created from the Image of a running GPU Server, it cannot be shared with another Account, and this feature is planned to be provided, so please note.

To receive an Image shared from another Account, follow the steps below.

Log in to the account to be shared and click the 모든 서비스 > Compute > GPU Server menu. Navigate to the Service Home page of the GPU Server.
Click the Image menu on the Service Home page. Go to the Image list page.
Image List on the page click the Get Image Share button. Go to the Get Image Share popup window.
Receive Image Share In the popup window, enter the resource ID of the Image you want to receive, and click the Confirm button.
When image sharing is completed, you can check the shared Image in the Image list.

Image Delete

You can delete unused Images. However, once an Image is deleted it cannot be recovered, so you should fully consider the impact before proceeding with the deletion.

Caution

Please be careful because data cannot be recovered after deleting the service.

To delete the image, follow the steps below.

All Services > Compute > GPU Server Click the menu. Navigate to the GPU Server’s Service Home page.
Click the Image menu on the Service Home page. Go to the Image List page.
On the Image list page, select the resource to delete and click the Delete button.

Image list page, select multiple Image check boxes, and click the Delete button at the top of the resource list.

When deletion is complete, check on the Image List page whether the resource has been deleted.

2 - Using Multi-instance GPU in GPU Server

After creating a GPU Server, you can enable the MIG (Multi-instance GPU) feature on the GPU Server’s VM (Guest OS) and create an instance to use it.

Multi-instance GPU (NVIDIA A100) Overview

NVIDIA A100 is a Multi-instance GPU (MIG) based on the NVIDIA Ampere architecture, which can be securely divided into up to 7 independent GPU instances to operate CUDA (Compute Unified Device Architecture) applications. The NVIDIA A100 provides independent GPU resources to multiple users by allocating computing resources in a way optimized for GPU usage while utilizing high-bandwidth memory (HBM) and cache. Users can maximize GPU utilization by utilizing workloads that have not reached the maximum computing capacity of the GPU through parallel execution of each workload.

Figure. Multi-instance GPU configuration diagram

Using Multi-instance GPU Feature

To use the multi-instance GPU feature, you must create a GPU Server service on the Samsung Cloud Platform and then create a VM Instance (GuestOS) with an A100 GPU assigned. After completing the GPU Server creation, you can follow the MIG application order and MIG release order below to apply it.

MIG Application Order

MIG activation → GPU Instance creation → Compute Instance creation → MIG usage

MIG Removal Order

Compute Instance deletion → GPU Instance deletion → MIG feature deactivation(disabling)

Reference

The system requirements for using the MIG feature are as follows (refer to NVIDIA - Supported GPUs).
- CUDA toolkit 11, NVIDIA driver 450.80.02 or later version
- Linux distribution operating system supporting CUDA toolkit 11
When operating a container or Kubernetes service, the requirements for using the MIG feature are as follows.
- NVIDIA Container Toolkit(nvidia-docker2) v 2.5.0 or later version
- NVIDIA K8s Device Plugin v 0.7.0 or later version
- NVIDIA gpu-feature-discovery v 0.2.0 or later version

MIG Application and Usage

To activate MIG and create an instance to assign a task, follow these steps.

MIG Application Order

MIG activation → GPU Instance creation → Compute Instance creation → MIG usage

MIG Activation

Check the GPU status on the VM Instance (GuestOS) before applying MIG.

MIG mode is Disabled status, please check.

Color mode

$ nvidia-smi
Mon Sep 27 08:37:08 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------|
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVDIA A100-SXM...  Off   | 00000000:05:00.0 Off |                    0 |
| N/A   32C   P0    59W / 400W  |      0MiB / 81251MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI       PID   Type   Process name                   GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
| No running processes found                                                  |
+-----------------------------------------------------------------------------+

$ nvidia-smi
Mon Sep 27 08:37:08 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------|
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVDIA A100-SXM...  Off   | 00000000:05:00.0 Off |                    0 |
| N/A   32C   P0    59W / 400W  |      0MiB / 81251MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI       PID   Type   Process name                   GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
| No running processes found                                                  |
+-----------------------------------------------------------------------------+

Code block. nvidia-smi command - Check GPU inactive state (1)

Color mode

$ nvidia-smi –L
GPU 0: NVIDIA A100-SXM-80GB (UUID: GPU-c956838f-494a-92b2-6818-56eb28fe25e0)

$ nvidia-smi –L
GPU 0: NVIDIA A100-SXM-80GB (UUID: GPU-c956838f-494a-92b2-6818-56eb28fe25e0)

Code block. nvidia-smi command - Check GPU inactive state (2)

In the VM Instance(GuestOS), enable MIG for each GPU and reboot the VM Instance.

Color mode

$ nvidia-smi –I 0 –mig 1
Enabled MIG mode for GPU 00000000:05:00.0
All done.

# reboot

$ nvidia-smi –I 0 –mig 1
Enabled MIG mode for GPU 00000000:05:00.0
All done.

# reboot

Code Block. nvidia-smi Command - MIG Activation

Note

If the GPU monitoring agent displays the following warning message, stop the nvsm and dcgm services before enabling MIG.

Warning: MIG mode is in pending enable state for GPU 00000000:05:00.0: In use by another client. 00000000:05:00.0 is currently being used by one or more other processes (e.g. CUDA application or a monitoring application such as another instance of nvidia-smi).

# systemctl stop nvsm
# systemctl stop dcgm

After completing the MIG work, restart the nvsm and dcgm services.

Check the GPU status after applying MIG in the VM Instance(GuestOS).

MIG mode must be in Enabled state.

Color mode

$ nvidia-smi
Mon Sep 27 09:44:33 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------|
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVDIA A100-SXM...  Off   | 00000000:05:00.0 Off |                   On |
| N/A   32C   P0    59W / 400W  |      0MiB / 81251MiB |      0%      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+-----------------------------------------------------------------------------+
|  GPU  GI  CI  MIG |        Memory-Usage |        Vol|        Shared         |
|       ID  ID  Dev |          BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                   |                     |        ECC|                       |
|=============================================================================|
| No MIG devices found                                                        |
+-----------------------------------------------------------------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI       PID   Type   Process name                   GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
| No running processes found                                                  |
+-----------------------------------------------------------------------------+

$ nvidia-smi
Mon Sep 27 09:44:33 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------|
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVDIA A100-SXM...  Off   | 00000000:05:00.0 Off |                   On |
| N/A   32C   P0    59W / 400W  |      0MiB / 81251MiB |      0%      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+-----------------------------------------------------------------------------+
|  GPU  GI  CI  MIG |        Memory-Usage |        Vol|        Shared         |
|       ID  ID  Dev |          BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                   |                     |        ECC|                       |
|=============================================================================|
| No MIG devices found                                                        |
+-----------------------------------------------------------------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI       PID   Type   Process name                   GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
| No running processes found                                                  |
+-----------------------------------------------------------------------------+

Code block. nvidia-smi command - Check GPU activation status (1)

Color mode

$ nvidia-smi –L
GPU 0: NVIDIA A100-SXM-80GB (UUID: GPU-c956838f-494a-92b2-6818-56eb28fe25e0)

$ nvidia-smi –L
GPU 0: NVIDIA A100-SXM-80GB (UUID: GPU-c956838f-494a-92b2-6818-56eb28fe25e0)

Code block. nvidia-smi command - Check GPU activation status (2)

GPU Instance Creation

After activating MIG and checking the status, you can create a GPU Instance.

Check the list of MIG GPU instance profiles that can be created.

Color mode

$ nvidia-smi mig -i [GPU ID] -lgip

$ nvidia-smi mig -i [GPU ID] -lgip

Code block. nvidia-smi command - MIG GPU Instance profile list check

Color mode

$ nvidia-smi mig -i 0 -lgip
+-----------------------------------------------------------------------------+
| GPU instance profiles:                                                      |
| GPU   Name             ID    Instances   Memory     P2P    SM    DEC   ENC  |
|                              Free/Total   GiB              CE    JPEG  OFA  |
|=============================================================================|
|   0 MIG 1g.10gb        19    7/7         9.50       No     14     0     0   |
|                                                             1     0     0   |
+-----------------------------------------------------------------------------+
|   0 MIG 1g.10gb+me     20    1/1         9.50       No     14     0     0   |
|                                                             1     1     1   |
+-----------------------------------------------------------------------------+
|   0 MIG 2g.20gb        14    3/3         19.50      No     28     1     0   |
|                                                             2     0     0   |
+-----------------------------------------------------------------------------+
|   0 MIG 3g.40gb         9    2/2         39.50      No     42     2     0   |
|                                                             3     0     0   |
+-----------------------------------------------------------------------------+
|   0 MIG 4g.40gb         5    1/1         39.50      No     56     2     0   |
|                                                             4     0     0   |
+-----------------------------------------------------------------------------+
|   0 MIG 7g.80gb         0    1/1         79.25      No     98     0     0   |
|                                                             7     1     1   |
+-----------------------------------------------------------------------------+

$ nvidia-smi mig -i 0 -lgip
+-----------------------------------------------------------------------------+
| GPU instance profiles:                                                      |
| GPU   Name             ID    Instances   Memory     P2P    SM    DEC   ENC  |
|                              Free/Total   GiB              CE    JPEG  OFA  |
|=============================================================================|
|   0 MIG 1g.10gb        19    7/7         9.50       No     14     0     0   |
|                                                             1     0     0   |
+-----------------------------------------------------------------------------+
|   0 MIG 1g.10gb+me     20    1/1         9.50       No     14     0     0   |
|                                                             1     1     1   |
+-----------------------------------------------------------------------------+
|   0 MIG 2g.20gb        14    3/3         19.50      No     28     1     0   |
|                                                             2     0     0   |
+-----------------------------------------------------------------------------+
|   0 MIG 3g.40gb         9    2/2         39.50      No     42     2     0   |
|                                                             3     0     0   |
+-----------------------------------------------------------------------------+
|   0 MIG 4g.40gb         5    1/1         39.50      No     56     2     0   |
|                                                             4     0     0   |
+-----------------------------------------------------------------------------+
|   0 MIG 7g.80gb         0    1/1         79.25      No     98     0     0   |
|                                                             7     1     1   |
+-----------------------------------------------------------------------------+

Code Block. MIG GPU Instance Profile List

Note

A100 GPU Instance profile refers to the example of NVIDIA A100 MIG Profile.

Profile Name	Fraction of Memory	Fraction of SMs	Hardware Units	L2 Cache Size	Number of Instances Available
MIG 1g.10gb	1/8	1/7	0 NVDECs /0 JPEG /0 OFA	1/8	7
MIG 1g.10gb+me	1/8	1/7	1 NVDEC /1 JPEG /1 OFA	1/8	1 (A single 1g profile can include media extensions)
MIG 2g.20gb	2/8	2/7	1 NVDECs /0 JPEG /0 OFA	2/8	3
MIG 3g.40gb	4/8	3/7	2 NVDECs /0 JPEG /0 OFA	4/8	2
MIG 4g.40gb	4/8	4/7	2 NVDECs /0 JPEG /0 OFA	4/8	1
MIG 7g.80gb	Full	7/7	5 NVDECs /1 JPEG /1 OFA	Full	1

Table. NVIDIA A100 MIG Profile

Note

MIG 1g.10gb+me profile can only be used when starting with the R470 driver.

Check after creating the MIG GPU Instance.

GPU Instance creation

Color mode

$ nvidia-smi mig -i [GPU ID] -cgi [Profile ID]

$ nvidia-smi mig -i [GPU ID] -cgi [Profile ID]

Code Block. nvidia-smi command - GPU Instance creation

Color mode

$ nvidia-smi mig -i 0 -cgi 0
Successfully created GPU instance ID 0 on GPU 0 using profile MIG 7g.80gb (ID 0)

$ nvidia-smi mig -i 0 -cgi 0
Successfully created GPU instance ID 0 on GPU 0 using profile MIG 7g.80gb (ID 0)

Code block. nvidia-smi command - GPU Instance creation example

GPU Instance check

Color mode

$ nvidia-smi mig -i [GPU ID] -lgi

$ nvidia-smi mig -i [GPU ID] -lgi

Code Block. nvidia-smi Command - GPU Instance Check

Color mode

$ nvidia-smi mig -i 0 -lgi
+--------------------------------------------------------+
| GPU instances:                                         |
| GPU   Name               Profile  Instance  Placement  |
|                            ID       ID      Start:Size |
|========================================================|
|   0  MIG 7g.80gb            0        0         0:8     |
+--------------------------------------------------------+

$ nvidia-smi mig -i 0 -lgi
+--------------------------------------------------------+
| GPU instances:                                         |
| GPU   Name               Profile  Instance  Placement  |
|                            ID       ID      Start:Size |
|========================================================|
|   0  MIG 7g.80gb            0        0         0:8     |
+--------------------------------------------------------+

Code block. nvidia-smi command - GPU Instance check example

Compute Instance Creation

If you have created a GPU Instance, you can create a Compute Instance.

Check the MIG Compute Instance profile that can be created.

Color mode

$ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] -lcip

$ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] -lcip

Code Block. nvidia-smi command - MIG Compute Instance profile check

Color mode

$ nvidia-smi mig -i 0 -gi 0 -lcip
+---------------------------------------------------------------------------------+
| Compute instance profiles:                                                      |
| GPU     GPU     Name            Profile  Instances   Exclusive      Shared      |
| GPU   Instance                     ID    Free/Total     SM       DEC  ENC  OFA  |
|         ID                                                       CE   JPEG      |
|=================================================================================|
|   0      0      MIG 1c.7g.80gb     0      7/7           14       5    0    1    |
|                                                                  7    1         |
+---------------------------------------------------------------------------------+
|   0      0      MIG 2c.7g.80gb     1      3/3           28       5    0    1    |
|                                                                  7    1         |
+---------------------------------------------------------------------------------+
|   0      0      MIG 3c.7g.80gb     2      2/2           42       5    0    1    |
|                                                                  7    1         |
+---------------------------------------------------------------------------------+
|   0      0      MIG 4c.7g.80gb     3      1/1           56       5    0    1    |
|                                                                  7    1         |
+---------------------------------------------------------------------------------+
|   0      0      MIG 7g.80gb        4*     1/1           98       5    0    1    |
|                                                                  7    1         |
+---------------------------------------------------------------------------------+

$ nvidia-smi mig -i 0 -gi 0 -lcip
+---------------------------------------------------------------------------------+
| Compute instance profiles:                                                      |
| GPU     GPU     Name            Profile  Instances   Exclusive      Shared      |
| GPU   Instance                     ID    Free/Total     SM       DEC  ENC  OFA  |
|         ID                                                       CE   JPEG      |
|=================================================================================|
|   0      0      MIG 1c.7g.80gb     0      7/7           14       5    0    1    |
|                                                                  7    1         |
+---------------------------------------------------------------------------------+
|   0      0      MIG 2c.7g.80gb     1      3/3           28       5    0    1    |
|                                                                  7    1         |
+---------------------------------------------------------------------------------+
|   0      0      MIG 3c.7g.80gb     2      2/2           42       5    0    1    |
|                                                                  7    1         |
+---------------------------------------------------------------------------------+
|   0      0      MIG 4c.7g.80gb     3      1/1           56       5    0    1    |
|                                                                  7    1         |
+---------------------------------------------------------------------------------+
|   0      0      MIG 7g.80gb        4*     1/1           98       5    0    1    |
|                                                                  7    1         |
+---------------------------------------------------------------------------------+

Code block. MIG Compute Instance profile list example

Create and check the MIG Compute Instance.

MIG Compute Instance creation

Color mode

$ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] -cci [Compute Profile ID]

$ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] -cci [Compute Profile ID]

Code Block. nvidia-smi command - MIG Compute Instance creation

Color mode

$ nvidia-smi mig -i 0 -gi 0 -cci 4
Successfully created compute instance ID 0 on GPU instance ID 0 using profile MIG 7g.80gb(ID 4)

$ nvidia-smi mig -i 0 -gi 0 -cci 4
Successfully created compute instance ID 0 on GPU instance ID 0 using profile MIG 7g.80gb(ID 4)

Code block. nvidia-smi command - MIG Compute Instance creation example

MIG Compute Instance check

Color mode

$ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] –lci

$ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] –lci

Code block. nvidia-smi command - MIG Compute Instance check

Color mode

$ nvidia-smi mig -i 0 -gi 0 –lci
+-----------------------------------------------------------------+
| Compute instance profiles:                                      |
| GPU     GPU     Name            Profile  Instances   Placement  |
| GPU   Instance                     ID      ID        Start:Size |
|         ID                                                      |
|=================================================================|
|   0      0      MIG 7g.80gb         4       0            0:7    |
+-----------------------------------------------------------------+

$ nvidia-smi mig -i 0 -gi 0 –lci
+-----------------------------------------------------------------+
| Compute instance profiles:                                      |
| GPU     GPU     Name            Profile  Instances   Placement  |
| GPU   Instance                     ID      ID        Start:Size |
|         ID                                                      |
|=================================================================|
|   0      0      MIG 7g.80gb         4       0            0:7    |
+-----------------------------------------------------------------+

Code block. MIG Compute Instance confirmation example

Color mode

$ nvidia-smi –L
GPU 0: NVIDIA A100-SXM-80GB (UUID: GPU-c956838f-494a-92b2-6818-56eb28fe25e0)
  MIG 7g.80gb     Device  0: (UUID: MIG-53e20040-758b-5ecb-948e-c626d03a9a32)

$ nvidia-smi –L
GPU 0: NVIDIA A100-SXM-80GB (UUID: GPU-c956838f-494a-92b2-6818-56eb28fe25e0)
  MIG 7g.80gb     Device  0: (UUID: MIG-53e20040-758b-5ecb-948e-c626d03a9a32)

Code block. nvidia-smi command - Check GPU status (1)

Color mode

$ nvidia-smi
Mon Sep 27 09:52:17 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------|
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVDIA A100-SXM...  Off   | 00000000:05:00.0 Off |                   On |
| N/A   32C   P0    49W / 400W  |      0MiB / 81251MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+-----------------------------------------------------------------------------+
|  GPU  GI  CI  MIG |        Memory-Usage |        Vol|        Shared         |
|       ID  ID  Dev |          BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                   |                     |        ECC|                       |
|=============================================================================|
|   0    0   0    0 |     0MiB / 81251MiB | 98      0 |  7   0    5    1    1 |
|                   |     1MiB / 13107... |           |                       |
+-----------------------------------------------------------------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI       PID   Type   Process name                   GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
| No running processes found                                                  |
+-----------------------------------------------------------------------------+

$ nvidia-smi
Mon Sep 27 09:52:17 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------|
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVDIA A100-SXM...  Off   | 00000000:05:00.0 Off |                   On |
| N/A   32C   P0    49W / 400W  |      0MiB / 81251MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+-----------------------------------------------------------------------------+
|  GPU  GI  CI  MIG |        Memory-Usage |        Vol|        Shared         |
|       ID  ID  Dev |          BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                   |                     |        ECC|                       |
|=============================================================================|
|   0    0   0    0 |     0MiB / 81251MiB | 98      0 |  7   0    5    1    1 |
|                   |     1MiB / 13107... |           |                       |
+-----------------------------------------------------------------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI       PID   Type   Process name                   GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
| No running processes found                                                  |
+-----------------------------------------------------------------------------+

Code block. nvidia-smi command - Check GPU status (2)

Using MIG

Use the MIG Instance to perform the Job.

Work execution example

Color mode

$ docker run --gpus '"device=[GPU ID]:[MIG ID]"' -rm nvcr.io/nvidia/cuda nvidia-smi

$ docker run --gpus '"device=[GPU ID]:[MIG ID]"' -rm nvcr.io/nvidia/cuda nvidia-smi

Code Block. Work Execution Example

You can check an example of how to perform the task as follows.

Color mode

$ docker run --gpus '"device=0:0"' -rm -it --network=host --shm-size=1g --ipc=host -v /root/.ssh/:/root/.ssh

================
== TensorFlow ==
================

NVIDIA Release 21.08-tf1 (build 26012104)
TensorFlow Version 1.15.5

Container image Copyright (c) 2021, NVIDIA CORPORATION. All right reserved.
...

# Python process execution
root@d622a93c9281:/workspace# python /workspace/nvidia-examples/cnn/resnet.py --num_iter 100 
...
PY 3.8.10 (default, Jun 2 2021, 10:49:15)
[GCC 9.4.0]
TF 1.15.5
...

$ docker run --gpus '"device=0:0"' -rm -it --network=host --shm-size=1g --ipc=host -v /root/.ssh/:/root/.ssh

================
== TensorFlow ==
================

NVIDIA Release 21.08-tf1 (build 26012104)
TensorFlow Version 1.15.5

Container image Copyright (c) 2021, NVIDIA CORPORATION. All right reserved.
...

# Python process execution
root@d622a93c9281:/workspace# python /workspace/nvidia-examples/cnn/resnet.py --num_iter 100 
...
PY 3.8.10 (default, Jun 2 2021, 10:49:15)
[GCC 9.4.0]
TF 1.15.5
...

Code Block. Work Result

Check the GPU usage rate. (Creating a JOB process)

You can see that when the Job is driven, the process is assigned to the MIG device and the usage rate increases.

Color mode

$ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] -lcip

$ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] -lcip

Code Block. nvidia-smi command - Check GPU usage

You can check the GPU usage rate as follows.

Color mode

+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+-----------------------------------------------------------------------------+
|  GPU  GI  CI  MIG |        Memory-Usage |        Vol|        Shared         |
|       ID  ID  Dev |          BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                   |                     |        ECC|                       |
|=============================================================================|
|   0    0   0    0 | 66562MiB / 81251MiB | 98      0 |  7   0    5    1    1 |
|                   |     5MiB / 13107... |           |                       |
+-----------------------------------------------------------------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI       PID   Type   Process name                   GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|   0     0    0     17483      C   python                           66559MiB |
+-----------------------------------------------------------------------------+

+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+-----------------------------------------------------------------------------+
|  GPU  GI  CI  MIG |        Memory-Usage |        Vol|        Shared         |
|       ID  ID  Dev |          BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                   |                     |        ECC|                       |
|=============================================================================|
|   0    0   0    0 | 66562MiB / 81251MiB | 98      0 |  7   0    5    1    1 |
|                   |     5MiB / 13107... |           |                       |
+-----------------------------------------------------------------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI       PID   Type   Process name                   GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|   0     0    0     17483      C   python                           66559MiB |
+-----------------------------------------------------------------------------+

Code block. Example of checking GPU usage

MIG Instance deletion and release

To delete a MIG instance and release the MIG, follow these procedures.

MIG Removal Order

Compute Instance deletion → GPU Instance deletion → MIG feature disablement (deactivation)

Compute Instance deletion

Delete the Compute Instance.

Color mode

$ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] –dci
$ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] -ci [Compute Instance] –dci

$ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] –dci
$ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] -ci [Compute Instance] –dci

Code Block. nvidia-smi command - Compute Instance deletion

Color mode

$ nvidia-smi mig -i 0 -gi 0 –lci
+-----------------------------------------------------------------+
| Compute instance profiles:                                      |
| GPU     GPU     Name            Profile  Instances   Placement  |
| GPU   Instance                     ID      ID        Start:Size |
|         ID                                                      |
|=================================================================|
|   0      0      MIG 7g.80gb         4       0            0:7    |
+-----------------------------------------------------------------+

$ nvidia-smi mig -i 0 -gi 0 –lci
+-----------------------------------------------------------------+
| Compute instance profiles:                                      |
| GPU     GPU     Name            Profile  Instances   Placement  |
| GPU   Instance                     ID      ID        Start:Size |
|         ID                                                      |
|=================================================================|
|   0      0      MIG 7g.80gb         4       0            0:7    |
+-----------------------------------------------------------------+

Code Block. MIG Compute Instance Check Example

Color mode

$ nvidia-smi mig -i 0 -gi 0 –dci
Successfully destroyed compute instance ID  0 from GPU instance ID  0

$ nvidia-smi mig -i 0 -gi 0 –dci
Successfully destroyed compute instance ID  0 from GPU instance ID  0

Code Block. Compute Instance deletion example

Color mode

$ nvidia-smi mig -i 0 -gi 0 –lci
No compute instances found: Not found

$ nvidia-smi mig -i 0 -gi 0 –lci
No compute instances found: Not found

Code Block. Compute Instance deletion confirmation

GPU Instance deletion

Delete the GPU Instance.

Color mode

$ nvidia-smi mig -i [GPU ID] –dgi
$ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] –dgi

$ nvidia-smi mig -i [GPU ID] –dgi
$ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] –dgi

Code block. nvidia-smi command - GPU Instance deletion

Color mode

$ nvidia-smi mig -i 0 -lgi
+--------------------------------------------------------+
| GPU instances:                                         |
| GPU   Name               Profile  Instance  Placement  |
|                            ID       ID      Start:Size |
|========================================================|
|   0  MIG 7g.80gb            0        0         0:8     |
+--------------------------------------------------------+

$ nvidia-smi mig -i 0 -lgi
+--------------------------------------------------------+
| GPU instances:                                         |
| GPU   Name               Profile  Instance  Placement  |
|                            ID       ID      Start:Size |
|========================================================|
|   0  MIG 7g.80gb            0        0         0:8     |
+--------------------------------------------------------+

Code block. nvidia-smi command - GPU Instance check example

Color mode

$ nvidia-smi mig -i 0 -dgi
Successfully destroyed GPU instance ID  0 from GPU  0

$ nvidia-smi mig -i 0 -dgi
Successfully destroyed GPU instance ID  0 from GPU  0

Code block. nvidia-smi command - GPU Instance deletion example

Color mode

$ nvidia-smi mig -i 0 -lgi
No GPU instances found: Not found

$ nvidia-smi mig -i 0 -lgi
No GPU instances found: Not found

Code block. nvidia-smi command - GPU Instance deletion example

MIG Function Disablement (Deactivation)

Disable MIG and then reboot.

Color mode

$ nvidia-smi -mig 0
Disabled MIG Mode for GPU 00000000:05:00.0

All done.

$ nvidia-smi -mig 0
Disabled MIG Mode for GPU 00000000:05:00.0

All done.

Code Block. nvidia-smi command - MIG disable

Color mode

$ nvidia-smi
Mon Sep 30 05:18:28 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------|
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVDIA A100-SXM...  Off   | 00000000:05:00.0 Off |                    0 |
| N/A   33C   P0    60W / 400W  |      0MiB / 81251MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+-----------------------------------------------------------------------------+
|  GPU  GI  CI  MIG |        Memory-Usage |        Vol|        Shared         |
|       ID  ID  Dev |          BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                   |                     |        ECC|                       |
|=============================================================================|
| No MIG devices found                                                        |
+-----------------------------------------------------------------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI       PID   Type   Process name                   GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
| No running processes found                                                  |
+-----------------------------------------------------------------------------+

$ nvidia-smi
Mon Sep 30 05:18:28 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------|
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVDIA A100-SXM...  Off   | 00000000:05:00.0 Off |                    0 |
| N/A   33C   P0    60W / 400W  |      0MiB / 81251MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+-----------------------------------------------------------------------------+
|  GPU  GI  CI  MIG |        Memory-Usage |        Vol|        Shared         |
|       ID  ID  Dev |          BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                   |                     |        ECC|                       |
|=============================================================================|
| No MIG devices found                                                        |
+-----------------------------------------------------------------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI       PID   Type   Process name                   GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
| No running processes found                                                  |
+-----------------------------------------------------------------------------+

Code Block. nvidia-smi command - Check GPU status

3 - Using NVSwitch on GPU Server

After creating a GPU Server, you can enable the NVSwitch feature in the GPU Server’s VM (Guest OS) and quickly use P2P (GPU to GPU) communication between GPUs.

Exploring NVIDIA NVSwitch for Multi GPU

NVIDIA A100 GPU server is a multi-GPU based on the NVIDIA Ampere architecture, with 8 Ampere 80 GB GPUs installed on the baseboard. The GPUs installed on the baseboard are connected to 6 NVSwitches via NVLink ports. Communication between GPUs on the baseboard is done using the full 600 GBps bandwidth. For this reason, the 8 GPUs installed on the A100 GPU server can be connected and operated like one, thereby maximizing GPU-to-GPU usage.

NVLink(25 GBps) 12 Lane 8 GPU configuration

Figure. NVLink(25 GBps) 12 lanes 8 GPU configuration diagram

NVSwitch(600 GBps) 6 units 8 GPU configuration diagram

Figure. NVSwitch(600 GBps) 6 units 8 GPU configuration diagram

Create GPU NVSwitch

To use the GPU NVSwitch feature, create a GPU Server service on the Samsung Cloud Platform, create a VM Instance (GuestOS) with 8 A100 GPUs assigned, and activate the Fabricmanager.

주의

NVSwitch can only be activated and used for products with 8 A100 GPUs assigned to a single GPU server (g1v128a8 (vCPU 128 | Memory 1920G | A100(80GB)*8)).
Currently, GPU Server created with Windows OS does not support NVSwitch (Fabricmanager).

NVSwitch Installation and Operation Check (Fabric Manager Activation)

To operate NVSwitch, install Fabricmanager on the GPU Instance and follow the next procedure.

Install NVIDIA GPU Driver (470.52.02 Version) on the GPU server.

Color mode

$ add-apt-repository ppa:graphics-drivers/ppa
$ apt-get update
$ apt-get install nvidia-driver-470-server

$ add-apt-repository ppa:graphics-drivers/ppa
$ apt-get update
$ apt-get install nvidia-driver-470-server

Code Block. NVIDIA GPU Driver Installation

Install and run NVIDIA Fabric Manager (470 Version) on the GPU server (For NVSwitch).

Color mode

$ apt-get install cuda-drivers-fabricmanager-470
$ systemctl enable nvidia-fabricmanager
$ systemctl start nvidia-fabricmanager

$ apt-get install cuda-drivers-fabricmanager-470
$ systemctl enable nvidia-fabricmanager
$ systemctl start nvidia-fabricmanager

Code Block. NVIDIA Fabric Manager Installation and Operation

Check the status of NVIDIA Fabric Manager running on the GPU server.
- Normal operation indicates active (running)
  Color mode
  $ systemctl status nvidia-fabricmanager
  $ systemctl status nvidia-fabricmanager
  Code Block. Check NVIDIA Fabric Manager Operation Status

Figure. NVSwitch installation - Checking the operation status of Fabric Manager

Check the NVSwitch operation status on the GPU server.
- Normal operation indicates NV12
  Color mode
  $ nvidia-smi topo --matrix
  $ nvidia-smi topo --matrix
  Code block. NVSwitch operation status check

Figure. NVSwitch Installation - Checking NVSwitch Operation Status

4 - Keypair Management

The user can enter the required information for the Keypair within the GPU Server service through the Samsung Cloud Platform Console, select detailed options, and create the service.

Keypair Create

You can create and use the Keypair service while using the GPU Server service on the Samsung Cloud Platform Console.

To create a keypair, follow these steps.

All Services > Compute > GPU Server Click the menu. Navigate to the Service Home page of GPU Server.
Click the Keypair menu on the Service Home page. It will go to the Keypair List page.

Click the Keypair List page’s Create Keypair button. It navigates to the Create Keypair page.

Service Information Input Enter the required information in the area.

Category	Required or not	Detailed description
Keypair name	Required	Name of the Keypair to create Enter within 255 characters using English letters, numbers, spaces, and special characters (`-`, `_`)
Keypair type	Required	ssh

Table. Keypair Service Information Input Items

Additional Information Input Enter or select the required information in the area.
Category
Required or not
Detailed description
Tag Select Add Tag
Up to 50 can be added per resource
After clicking the Add Tag button, enter or select Key, Value values
Table. Keypair additional information input items
Caution
- After creation is complete, you can download the Key only once for the first time. Since reissuance is not possible, make sure it has been downloaded.
- Save the downloaded Private Key in a safe place.

Check the input information and click the Complete button.
- When creation is complete, check the created resource on the Keypair List page.

Category	Required or not	Detailed description
Tag	Select	Add Tag Up to 50 can be added per resource After clicking the Add Tag button, enter or select Key, Value values

Keypair Check detailed information

Keypair service can view and edit the full resource list and detailed information. Keypair Details page consists of Details, Tags, Activity History tabs.

To view detailed information about the Keypair, follow the steps below.

All Services > Compute > GPU Server Click the menu. Navigate to the GPU Server’s Service Home page.
Service Home page, click the Keypair menu. Navigate to the Keypair list page.
Keypair List Click the resource to view detailed information on the page. Go to the Keypair Detail page.
- Keypair Details page displays status information and additional feature information, and consists of Details, Tags, Activity History tabs.

Detailed Information

On the Keypair List page, you can view detailed information of the selected resource and, if necessary, edit the information.

Category	Detailed description
Service	Service Name
Resource Type	Resource Type
SRN	Unique resource ID in Samsung Cloud Platform In Keypair, it means Keypair SRN
Resource Name	Keypair Name
Resource ID	Keypair’s unique resource ID
Creator	User who created the Keypair
Creation time	Time when the keypair was created
Editor	User who modified the Keypair information
Modification Date/Time	Timestamp of Keypair information modification
Keypair name	Keypair name
Fingerprint	Unique value for identifying a Key
User ID	User ID of the user who created the Keypair
Public Key	Public Key Information

Table. Keypair detailed information tab items

Category	Detailed description
Tag List	Tag List Tag’s Key, Value information can be checked Up to 50 tags can be added per resource When entering a tag, search and select from the list of previously created Keys and Values

Work History

On the Keypair list page, you can view the operation history of the selected resource.

Category	Detailed description
Work History List	Resource Change History Task date/time, Resource ID, Resource name, Task details, Event topic, Task result, Operator information verification

Table. Keypair operation history tab detailed information items

Keypair Resource Management

Describes the control and management functions of the keypair.

Get public key

To retrieve the public key, follow the steps below.

Click the All Services > Compute > GPU Server menu. Go to the GPU Server’s Service Home page.
Service Home page, click the Keypair menu. Navigate to the Keypair list page.

In the Keypair list page, click the More button at the top, then click the Import Public Key button. You will be taken to the Import Public Key page.

Required Information Input Enter or select the required information in the area.

Category	Required	Detailed description
Keypair name	required	Name of the Keypair to create
Keypair type	required	ssh
Public Key	Required	Enter Public Key Load File: Attach File button to select and attach the public key file Only files with the following extension (.pem) can be attached Enter Public Key: Paste the copied public key value Public key value can be copied from the Keypair Details page

Category

Required

Detailed description

Keypair name

required

Name of the Keypair to create

Keypair type

required

ssh

Public Key

Required

Enter Public Key

Load File: Attach File button to select and attach the public key file
- Only files with the following extension (.pem) can be attached

Enter Public Key: Paste the copied public key value
- Public key value can be copied from the Keypair Details page

Table. Required input items for fetching public key

Check the entered information and click the Complete button.
- Once creation is complete, check the created resource on the Keypair List page.

Delete Keypair

You can delete unused Keypairs. However, once a Keypair is deleted it cannot be recovered, so please review the impact thoroughly in advance before proceeding with deletion.

Caution

Please be careful as data cannot be recovered after deleting the service.

To delete a keypair, follow the steps below.

All Services > Compute > GPU Server Click the menu. Navigate to the GPU Server’s Service Home page.
Click the Keypair menu on the Service Home page. It moves to the Keypair List page.
On the Keypair list page, select the resource to delete and click the Delete button.

On the Keypair list page, select multiple Keypair check boxes, and click the Delete button at the top of the resource list.

After deletion is complete, check on the Keypair List page whether the resource has been deleted.

5 - ServiceWatch Agent Install

Users can install the ServiceWatch Agent on the GPU Server to collect custom metrics and logs.

Reference

Custom metric/log collection via the ServiceWatch Agent is currently only available on Samsung Cloud Platform For Enterprise. It will be offered in other offerings in the future.

Caution

Metric collection via ServiceWatch Agent is classified as custom metrics and, unlike the default metrics collected from each service, incurs charges, so it is recommended to remove or disable unnecessary metric collection settings.

ServiceWatch Agent

The agents that need to be installed on the GPU Server for collecting ServiceWatch’s custom metrics and logs can be divided into two main types. It is Prometheus Exporter and Open Telemetry Collector.

Category	Detailed description
Prometheus Exporter	Provides metrics of a specific application or service in a format that Prometheus can scrape For collecting server OS metrics, you can use Node Exporter for Linux servers and Windows Exporter for Windows servers, depending on the OS type. On GPU servers, you can use Node Exporter for OS metric collection just like on Virtual Servers; see Virtual Server > ServiceWatch Agent for details You can use the DCGM (NVIDIA Data Center GPU Manager) Exporter for GPU metrics In this guide, the method to use DCGM Exporter is described in the guides GPU 지표를 위한 Prometheus Exporter 설치 (for Ubuntu) and GPU 지표를 위한 Prometheus Exporter 설치 (for RHEL)
Open Telemetry Collector	Collects telemetry data such as metrics and logs from distributed systems, processes (filtering, sampling, etc.) them, and acts as a centralized collector that exports to various backends (e.g., Prometheus, Jaeger, Elasticsearch, etc.) Exports data to ServiceWatch Gateway so that ServiceWatch can collect metric and log data. For how to use the Open Telemetry Collector provided by ServiceWatch, see ServiceWatch > Using ServiceWatch Agent

Category

Detailed description

Prometheus Exporter

Provides metrics of a specific application or service in a format that Prometheus can scrape

For collecting server OS metrics, you can use Node Exporter for Linux servers and Windows Exporter for Windows servers, depending on the OS type.
- On GPU servers, you can use Node Exporter for OS metric collection just like on Virtual Servers; see Virtual Server > ServiceWatch Agent for details

You can use the DCGM (NVIDIA Data Center GPU Manager) Exporter for GPU metrics
- In this guide, the method to use DCGM Exporter is described in the guides GPU 지표를 위한 Prometheus Exporter 설치 (for Ubuntu) and GPU 지표를 위한 Prometheus Exporter 설치 (for RHEL)

Open Telemetry Collector

Collects telemetry data such as metrics and logs from distributed systems, processes (filtering, sampling, etc.) them, and acts as a centralized collector that exports to various backends (e.g., Prometheus, Jaeger, Elasticsearch, etc.)

Exports data to ServiceWatch Gateway so that ServiceWatch can collect metric and log data.

For how to use the Open Telemetry Collector provided by ServiceWatch, see ServiceWatch > Using ServiceWatch Agent

Table. Description of Prometheus Exporter and Open Telemetry Collector

Caution

If you have configured Kubernetes Engine on a GPU Server, please check GPU metrics through the metrics provided by Kubernetes Engine.

If you install the DCGM Exporter on a GPU server configured with Kubernetes Engine, it may not work properly.

Install Prometheus Exporter for GPU metrics (for Ubuntu)

Install the Prometheus Exporter to collect metrics of the GPU Server according to the steps below.

Step 1. NVDIA Driver Installation Check
Step 2. NVSwitch Configuration and Query (NSCQ) Library Installation
Step 3. NVSwitch Device Monitoring API(NVSDM) Library Installation
Step 4. NVIDIA DCGM Installation

NVDIA Driver Installation Check

Check the installed NVDIA Driver.
Color mode
nvidia-smi --query-gpu driver_version --format csv
nvidia-smi --query-gpu driver_version --format csv
Code block. NVDIA Driver version check command
Color mode
driver_version 535.183.06 ... 535.183.06
driver_version 535.183.06 ... 535.183.06
Code block. NVDIA Driver version check example

NVSwitch Configuration and Query (NSCQ) Library Installation

Reference

NVSwitch Configuration and Query (NSCQ) Library is required for Hopper or earlier Generation GPUs.

Notice

The installation commands below are possible in an environment where the internet is available. If you are in an environment without internet, you need to download libnvdia-nscq from https://developer.download.nvidia.com/compute/cuda/repos/ and upload it.

Install cuda-keyring.

Color mode

wget https://developer.download.nvidia.com/compute/cuda/repos/<distro>/<arch>/cuda-keyring_1.1-1_all.deb

wget https://developer.download.nvidia.com/compute/cuda/repos/<distro>/<arch>/cuda-keyring_1.1-1_all.deb

Code block. NSCQ library download command

Color mode

sudo dpkg -i cuda-keyring_1.1-1_all.deb
apt update

sudo dpkg -i cuda-keyring_1.1-1_all.deb
apt update

Code block. NSCQ library installation command

Color mode

nvidia-smi --query-gpu driver_version --format csv

nvidia-smi --query-gpu driver_version --format csv

Code block. NVDIA Driver version check command

Color mode

driver_version
535.183.06
...

535.183.06

driver_version
535.183.06
...

535.183.06

Code block. NVDIA Driver version check example

Install libnvidia-nscq.

Color mode

apt-cache policy libnvidia-nscq-535

apt-cache policy libnvidia-nscq-535

Code block. NSCQ library apt-cache command

Color mode

libnvidia-nscq-535:
  Installed: (none)
  Candidate: 535.247.01-1
  Version table:
     535.247.01-1 600
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
...
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
     535.216.01-1 600
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
     535.183.06-1 600  # Install version matching the Driver
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
     535.183.01-1 600
...
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
     535.54.03-1 600
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages

libnvidia-nscq-535:
  Installed: (none)
  Candidate: 535.247.01-1
  Version table:
     535.247.01-1 600
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
...
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
     535.216.01-1 600
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
     535.183.06-1 600  # Install version matching the Driver
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
     535.183.01-1 600
...
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
     535.54.03-1 600
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages

Code block. NSCQ library apt-cache command result

Color mode

apt install libnvidia-nscq-535=535.183.06-1

apt install libnvidia-nscq-535=535.183.06-1

Code block. NSCQ library installation command

Notice

It must be installed with the same version as the NVDIA Driver version.

Example) driver version: 535.183.06, libnvdia-nscq version: 535.183.06-1

NVSwitch Device Monitoring API(NVSDM) Library Installation

Note

After Blackwell GPU Architecture, installation of the NVSDM Library is required. NVIDIA Driver versions 560 and below do not provide the NVSDM Library.

NVSDM library install.

Color mode

apt-cache policy libnvsdm

apt-cache policy libnvsdm

Code block. NVSDM library apt-cache command

Color mode

libnvsdm:
  Installed: (none)
  Candidate: 580.105.08-1
  Version table:
     580.105.08-1 600
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
     580.95.05-1 600
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
     580.82.07-1 600
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
     580.65.06-1 600
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages

libnvsdm:
  Installed: (none)
  Candidate: 580.105.08-1
  Version table:
     580.105.08-1 600
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
     580.95.05-1 600
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
     580.82.07-1 600
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
     580.65.06-1 600
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages

Code block. NVSDM library apt-cache command result

Color mode

apt install libnvsdm=580.105.08-1

apt install libnvsdm=580.105.08-1

Code block. NVSDM library installation

NVIDIA DCGM Installation (for Ubuntu)

Install the DCGM Exporter according to the following steps.

DCGM(datacenter-gpu-manager) Installation

It refers to a specific version of NVIDIA’s Data Center GPU Manager (DCGM) tool, which is a package for managing and monitoring NVIDIA data center GPUs. Specifically, cuda12 indicates that this management tool is installed for CUDA version 12, and datacenter‑gpu‑manager‑4 denotes the 4.x version of DCGM. This tool provides various features including GPU status monitoring, diagnostics, alert system, and power/clock management.

Check the CUDA version.

Color mode

nvidia-smi | grep CUDA

nvidia-smi | grep CUDA

Code block. Check CUDA version

Color mode

| NVIDIA-SMI 535.183.06             Driver Version: 535.183.06     CUDA Version: 12.2     |

| NVIDIA-SMI 535.183.06             Driver Version: 535.183.06     CUDA Version: 12.2     |

Code block. Example of CUDA version check result

Color mode

CUDA_VERSION=12

CUDA_VERSION=12

Code block. CUDA version setting command

Install datacenter-gpu-manager-cuda.
Color mode
apt install datacenter-gpu-manager-4-cuda${CUDA_VERSION}
apt install datacenter-gpu-manager-4-cuda${CUDA_VERSION}
Code block. datacenter-gpu-manager-cuda installation command

datacenter-gpu-manager-exporter installation

Based on NVIDIA Data Center GPU Manager (DCGM), it is a tool that collects various GPU metrics such as GPU usage, memory usage, temperature, and power consumption, and exposes them so they can be used in monitoring systems like Prometheus.

Install datacenter-gpu-manager-exporter.
Color mode
apt install datacenter-gpu-manager-exporter
apt install datacenter-gpu-manager-exporter
Code block. datacenter-gpu-manager-exporter installation command

Check the DCGM Exporter configuration file.

Color mode

cat /usr/lib/systemd/system/nvidia-dcgm-exporter.service | grep ExecStart

cat /usr/lib/systemd/system/nvidia-dcgm-exporter.service | grep ExecStart

Code block. datacenter-gpu-manager-exporter configuration file verification command

Color mode

ExecStart=/usr/bin/dcgm-exporter -f /etc/dcgm-exporter/default-counters.csv

ExecStart=/usr/bin/dcgm-exporter -f /etc/dcgm-exporter/default-counters.csv

Code block. datacenter-gpu-manager-exporter configuration file check result example

When installing DCGM Exporter, check the provided settings and remove # from the required metrics, and add # to the unnecessary metrics.

Color mode

vi /etc/dcgm-exporter/default-counters.csv
## Example ##
...
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active.
DCGM_FI_PROF_DRAM_ACTIVE,        gauge, Ratio of cycles the device memory interface is active sending or receiving data.
# DCGM_FI_PROF_PIPE_FP64_ACTIVE,   gauge, Ratio of cycles the fp64 pipes are active.
# DCGM_FI_PROF_PIPE_FP32_ACTIVE,   gauge, Ratio of cycles the fp32 pipes are active.
...

vi /etc/dcgm-exporter/default-counters.csv
## Example ##
...
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active.
DCGM_FI_PROF_DRAM_ACTIVE,        gauge, Ratio of cycles the device memory interface is active sending or receiving data.
# DCGM_FI_PROF_PIPE_FP64_ACTIVE,   gauge, Ratio of cycles the fp64 pipes are active.
# DCGM_FI_PROF_PIPE_FP32_ACTIVE,   gauge, Ratio of cycles the fp32 pipes are active.
...

Code block. datacenter-gpu-manager-exporter metric setting example

Reference

For the metrics that can be collected with GPU DCGM Exporter and how to configure them, see DCGM Exporter metrics.

Caution

Since metric collection via ServiceWatch Agent is classified as custom metrics and incurs charges unlike the default collected metrics, unnecessary metric collection must be removed or disabled to avoid excessive charges.

DCGM Service activation and start

Activate and start the nvdia-dcgm service.
Color mode
systemctl enable --now nvidia-dcgm
systemctl enable --now nvidia-dcgm
Code block. nvdia-dcgm service activation and start command
Activate and start the nvdia-dcgm-exporter service.
Color mode
systemctl enable --now nvidia-dcgm-exporter
systemctl enable --now nvidia-dcgm-exporter
Code block. nvdia-dcgm-exporter service activation and start command

Notice

If you have completed the DCGM Exporter setup, you need to install the Open Telemetry Collector provided by ServiceWatch to complete the ServiceWatch Agent configuration.
For more details, see ServiceWatch > ServiceWatch Agent Using.

Installation of Prometheus Exporter for GPU metrics (for RHEL)

Install the ServiceWatch Agent according to the steps below to collect metrics of the GPU Server.

NVDIA Driver Installation Check (for RHEL)

Check the installed NVDIA Driver.
Color mode
nvidia-smi --query-gpu driver_version --format csv
nvidia-smi --query-gpu driver_version --format csv
Code block. NVDIA Driver version check command
Color mode
driver_version 535.183.06 ... 535.183.06
driver_version 535.183.06 ... 535.183.06
Code block. NVDIA Driver version check example

NVSwitch Configuration and Query (NSCQ) Library installation (for RHEL)

Reference

NVSwitch Configuration and Query (NSCQ) Library is required for Hopper or earlier Generation GPUs.

For RHEL, check if libnvdia-nscq is installed and then install it.

Notice

The installation commands below can be used in an environment with internet access. If you are in an environment without internet, you need to download libnvdia-nscq from https://developer.download.nvidia.com/compute/cuda/repos/ and upload it.

Check the libnvdia-nscq package.

Color mode

rpm -qa | grep libnvidia-nscq libnvidia-nscq-535-535.183.06-1.x86_64

rpm -qa | grep libnvidia-nscq libnvidia-nscq-535-535.183.06-1.x86_64

Code block. NSCQ library package check

Add CUDA Repository to DNF.

Color mode

dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo

dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo

Code block. Add DNF Repository

NVDIA Driver status initialization

Color mode

dnf module reset nvidia-driver

dnf module reset nvidia-driver

Code block. NVIDIA Driver DNF module state initialization

Color mode

Updating Subscription Management repositories.
Last metadata expiration check: 0:03:15 ago on Wed 19 Nov 2025 01:23:48 AM EST.
Dependencies resolved.
=============================================
Package Architecture Version Repository Size
=============================================
Disabling module profiles:
nvidia-driver/default
nvidia-driver/fm
Resetting modules:
nvidia-driver

Transaction Summary
=============================================

Is this ok [y/N]: y

Updating Subscription Management repositories.
Last metadata expiration check: 0:03:15 ago on Wed 19 Nov 2025 01:23:48 AM EST.
Dependencies resolved.
=============================================
Package Architecture Version Repository Size
=============================================
Disabling module profiles:
nvidia-driver/default
nvidia-driver/fm
Resetting modules:
nvidia-driver

Transaction Summary
=============================================

Is this ok [y/N]: y

Code block. Example of the result of state initialization of the NVIDIA Driver DNF module

Activate the NVDIA Driver module.

Color mode

dnf module enable nvidia-driver:535-open

dnf module enable nvidia-driver:535-open

Code block. NVDIA Driver module activation

Color mode

Updating Subscription Management repositories.
Last metadata expiration check: 0:04:22 ago on Wed 19 Nov 2025 01:23:48 AM EST.
Dependencies resolved.
=============================================

Package Architecture Version Repository Size
=============================================
Enabling module streams:
nvidia-driver 535-open
  
Transaction Summary
=============================================

Is this ok [y/N]: y

Updating Subscription Management repositories.
Last metadata expiration check: 0:04:22 ago on Wed 19 Nov 2025 01:23:48 AM EST.
Dependencies resolved.
=============================================

Package Architecture Version Repository Size
=============================================
Enabling module streams:
nvidia-driver 535-open
  
Transaction Summary
=============================================

Is this ok [y/N]: y

Code block. NVDIA Driver module activation result example

Check the libnvdia-nscq module list.
Color mode
dnf list libnvidia-nscq-535 --showduplicates
dnf list libnvidia-nscq-535 --showduplicates
Code block. libnvdia-nscq module list check
Install libnvdia-nscq.
Color mode
dnf install libnvidia-nscq-535-535.183.06-1
dnf install libnvidia-nscq-535-535.183.06-1
Code block. libnvdia-nscq installation command

NVSwitch Device Monitoring API(NVSDM) Library Installation (for RHEL)

Reference

After Blackwell, GPU Architecture requires NVSDM Library installation. NVDIA Driver versions 560 and below do not provide the NVSDM Library.

Check the NVSDM library module list.

Color mode

dnf list libnvsdm --showduplicates

dnf list libnvsdm --showduplicates

Code block. Check NVSDM library module list

Color mode

libnvsdm:
  Installed: (none)
  Candidate: 580.105.08-1
  Version table:
     580.105.08-1 600
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
     580.95.05-1 600
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
     580.82.07-1 600
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
     580.65.06-1 600
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages

libnvsdm:
  Installed: (none)
  Candidate: 580.105.08-1
  Version table:
     580.105.08-1 600
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
     580.95.05-1 600
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
     580.82.07-1 600
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
     580.65.06-1 600
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages

Code block. NVSDM library module list verification result example

Install libnvsdm.

Color mode

dnf install libnvsdm-580.105.08-1

dnf install libnvsdm-580.105.08-1

Code block. NVSDM library installation

Color mode

Updating Subscription Management repositories.
Last metadata expiration check: 0:08:18 ago on Wed 19 Nov 2025 01:05:28 AM EST.
Dependencies resolved.
=========================================================================

Package Architecture Version Repository Size
=========================================================================

Installing:
libnvsdm x86_64 580.105.08-1 cuda-rhel8-x86_64 675 k
Installing dependencies:
infiniband-diags x86_64 48.0-1.el8 rhel-8-for-x86_64-baseos-rpms 323 k
libibumad x86_64 48.0-1.el8 rhel-8-for-x86_64-baseos-rpms 34 k

Transaction Summary
=========================================================================

Install 3 Packages

Total download size: 1.0 M
Installed size: 3.2 M
Is this ok [y/N]: y

Updating Subscription Management repositories.
Last metadata expiration check: 0:08:18 ago on Wed 19 Nov 2025 01:05:28 AM EST.
Dependencies resolved.
=========================================================================

Package Architecture Version Repository Size
=========================================================================

Installing:
libnvsdm x86_64 580.105.08-1 cuda-rhel8-x86_64 675 k
Installing dependencies:
infiniband-diags x86_64 48.0-1.el8 rhel-8-for-x86_64-baseos-rpms 323 k
libibumad x86_64 48.0-1.el8 rhel-8-for-x86_64-baseos-rpms 34 k

Transaction Summary
=========================================================================

Install 3 Packages

Total download size: 1.0 M
Installed size: 3.2 M
Is this ok [y/N]: y

Code block. NVSDM library installation command result example

NVIDIA DCGM Installation (for RHEL)

Install Node Exporter according to the steps below.

DCGM(datacenter-gpu-manager) Installation (for RHEL)

It refers to a specific version of NVIDIA’s Data Center GPU Manager (DCGM) tool, which is a package for managing and monitoring NVIDIA data center GPUs. Specifically, cuda12 indicates that this management tool is installed for CUDA version 12, and datacenter‑gpu‑manager‑4 denotes the 4.x version of DCGM. This tool provides various functions including GPU status monitoring, diagnostics, alert system, and power/clock management.

Add CUDA Repository to DNF.

Color mode

dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo

dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo

Code block. Add DNF Repository

Check the CUDA version.

Color mode

nvidia-smi | grep CUDA

nvidia-smi | grep CUDA

Code block. Check CUDA version

Color mode

| NVIDIA-SMI 535.183.06             Driver Version: 535.183.06     CUDA Version: 12.2     |

| NVIDIA-SMI 535.183.06             Driver Version: 535.183.06     CUDA Version: 12.2     |

Code block. Example of CUDA version check result

Color mode

CUDA_VERSION=12

CUDA_VERSION=12

Code block. CUDA version setting command

Check the list of datacenter-gpu-manager-cuda modules.

Color mode

dnf list datacenter-gpu-manager-4-cuda${CUDA_VERSION} --showduplicates

dnf list datacenter-gpu-manager-4-cuda${CUDA_VERSION} --showduplicates

Code block. Check datacenter-gpu-manager-cuda module list

Color mode

Updating Subscription Management repositories.
Unable to read consumer identity

This system is not registered with an entitlement server. You can use subscription-manager to register.

Last metadata expiration check: 0:00:34 ago on Wed 19 Nov 2025 12:26:56 AM EST.
Available Packages
datacenter-gpu-manager-4-cuda12.x86_64   1:4.0.0-1    cuda-rhel8-x86_64
datacenter-gpu-manager-4-cuda12.x86_64   1:4.1.0-1    cuda-rhel8-x86_64
datacenter-gpu-manager-4-cuda12.x86_64   1:4.1.1-1    cuda-rhel8-x86_64
datacenter-gpu-manager-4-cuda12.x86_64   1:4.2.0-1    cuda-rhel8-x86_64
datacenter-gpu-manager-4-cuda12.x86_64   1:4.2.2-1    cuda-rhel8-x86_64
datacenter-gpu-manager-4-cuda12.x86_64   1:4.2.3-1    cuda-rhel8-x86_64
datacenter-gpu-manager-4-cuda12.x86_64   1:4.2.3-2    cuda-rhel8-x86_64
datacenter-gpu-manager-4-cuda12.x86_64   1:4.3.0-1    cuda-rhel8-x86_64
datacenter-gpu-manager-4-cuda12.x86_64   1:4.3.1-1    cuda-rhel8-x86_64
datacenter-gpu-manager-4-cuda12.x86_64   1:4.4.0-1    cuda-rhel8-x86_64
datacenter-gpu-manager-4-cuda12.x86_64   1:4.4.1-1    cuda-rhel8-x86_64
datacenter-gpu-manager-4-cuda12.x86_64   1:4.4.2-1    cuda-rhel8-x86_64

Updating Subscription Management repositories.
Unable to read consumer identity

This system is not registered with an entitlement server. You can use subscription-manager to register.

Last metadata expiration check: 0:00:34 ago on Wed 19 Nov 2025 12:26:56 AM EST.
Available Packages
datacenter-gpu-manager-4-cuda12.x86_64   1:4.0.0-1    cuda-rhel8-x86_64
datacenter-gpu-manager-4-cuda12.x86_64   1:4.1.0-1    cuda-rhel8-x86_64
datacenter-gpu-manager-4-cuda12.x86_64   1:4.1.1-1    cuda-rhel8-x86_64
datacenter-gpu-manager-4-cuda12.x86_64   1:4.2.0-1    cuda-rhel8-x86_64
datacenter-gpu-manager-4-cuda12.x86_64   1:4.2.2-1    cuda-rhel8-x86_64
datacenter-gpu-manager-4-cuda12.x86_64   1:4.2.3-1    cuda-rhel8-x86_64
datacenter-gpu-manager-4-cuda12.x86_64   1:4.2.3-2    cuda-rhel8-x86_64
datacenter-gpu-manager-4-cuda12.x86_64   1:4.3.0-1    cuda-rhel8-x86_64
datacenter-gpu-manager-4-cuda12.x86_64   1:4.3.1-1    cuda-rhel8-x86_64
datacenter-gpu-manager-4-cuda12.x86_64   1:4.4.0-1    cuda-rhel8-x86_64
datacenter-gpu-manager-4-cuda12.x86_64   1:4.4.1-1    cuda-rhel8-x86_64
datacenter-gpu-manager-4-cuda12.x86_64   1:4.4.2-1    cuda-rhel8-x86_64

Code block. datacenter-gpu-manager-cuda module list check result example

Install datacenter-gpu-manager-cuda.

Color mode

dnf install datacenter-gpu-manager-4-cuda${CUDA_VERSION}

dnf install datacenter-gpu-manager-4-cuda${CUDA_VERSION}

Code block. datacenter-gpu-manager-cuda installation

Color mode

Updating Subscription Management repositories.
Unable to read consumer identity

This system is not registered with an entitlement server. You can use subscription-manager to register.

Last metadata expiration check: 0:07:12 ago on Wed 19 Nov 2025 12:26:56 AM EST.
Dependencies resolved.
===================================================================================================
Package                                       Architecture   Version     Repository          Size
===================================================================================================
Installing:
 datacenter-gpu-manager-4-cuda12               x86_64         1:4.4.2-1   cuda-rhel8-x86_64   554 M
Installing dependencies:
 datacenter-gpu-manager-4-core                 x86_64         1:4.4.2-1   cuda-rhel8-x86_64   9.9 M
Installing weak dependencies:
 datacenter-gpu-manager-4-proprietary          x86_64         1:4.4.2-1   cuda-rhel8-x86_64   5.3 M
 datacenter-gpu-manager-4-proprietary-cuda12   x86_64         1:4.4.2-1   cuda-rhel8-x86_64   289 M

Transaction Summary
====================================================================================================
Install  4 Packages
...
Is this ok [y/N]: y

Updating Subscription Management repositories.
Unable to read consumer identity

This system is not registered with an entitlement server. You can use subscription-manager to register.

Last metadata expiration check: 0:07:12 ago on Wed 19 Nov 2025 12:26:56 AM EST.
Dependencies resolved.
===================================================================================================
Package                                       Architecture   Version     Repository          Size
===================================================================================================
Installing:
 datacenter-gpu-manager-4-cuda12               x86_64         1:4.4.2-1   cuda-rhel8-x86_64   554 M
Installing dependencies:
 datacenter-gpu-manager-4-core                 x86_64         1:4.4.2-1   cuda-rhel8-x86_64   9.9 M
Installing weak dependencies:
 datacenter-gpu-manager-4-proprietary          x86_64         1:4.4.2-1   cuda-rhel8-x86_64   5.3 M
 datacenter-gpu-manager-4-proprietary-cuda12   x86_64         1:4.4.2-1   cuda-rhel8-x86_64   289 M

Transaction Summary
====================================================================================================
Install  4 Packages
...
Is this ok [y/N]: y

Code block. datacenter-gpu-manager-cuda installation result example

datacenter-gpu-manager-exporter installation (for RHEL)

It is a tool that, based on NVIDIA Data Center GPU Manager (DCGM), collects various GPU metrics such as GPU usage, memory usage, temperature, and power consumption, and exposes them for use in monitoring systems like Prometheus.

Add the CUDA Repository to DNF. (If you have already performed this command, proceed to the next step.)

Color mode

dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo

dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo

Code block. Add DNF Repository

Check the CUDA version. (If you have already performed this command, proceed to the next step.)
Color mode
nvidia-smi | grep CUDA
nvidia-smi | grep CUDA
Code block. Check CUDA version
Color mode
| NVIDIA-SMI 535.183.06 Driver Version: 535.183.06 CUDA Version: 12.2 |
| NVIDIA-SMI 535.183.06 Driver Version: 535.183.06 CUDA Version: 12.2 |
Code block. Example of CUDA version check result
Color mode
CUDA_VERSION=12
CUDA_VERSION=12
Code block. CUDA version setting command

Check the list of datacenter-gpu-manager-exporter modules.

Color mode

dnf list datacenter-gpu-manager-exporter --showduplicates

dnf list datacenter-gpu-manager-exporter --showduplicates

Code block. datacenter-gpu-manager-exporter module list check

Color mode

Updating Subscription Management repositories.
Unable to read consumer identity

This system is not registered with an entitlement server. You can use subscription-manager to register.

Last metadata expiration check: 0:02:11 ago on Wed 19 Nov 2025 12:26:56 AM EST.
Available Packages
datacenter-gpu-manager-exporter.x86_64   4.0.1-1   cuda-rhel8-x86_64
datacenter-gpu-manager-exporter.x86_64   4.1.0-1   cuda-rhel8-x86_64
datacenter-gpu-manager-exporter.x86_64   4.1.1-1   cuda-rhel8-x86_64
datacenter-gpu-manager-exporter.x86_64   4.1.3-1   cuda-rhel8-x86_64
datacenter-gpu-manager-exporter.x86_64   4.5.0-1   cuda-rhel8-x86_64
datacenter-gpu-manager-exporter.x86_64   4.5.1-1   cuda-rhel8-x86_64
datacenter-gpu-manager-exporter.x86_64   4.5.2-1   cuda-rhel8-x86_64
datacenter-gpu-manager-exporter.x86_64   4.6.0-1   cuda-rhel8-x86_64

Updating Subscription Management repositories.
Unable to read consumer identity

This system is not registered with an entitlement server. You can use subscription-manager to register.

Last metadata expiration check: 0:02:11 ago on Wed 19 Nov 2025 12:26:56 AM EST.
Available Packages
datacenter-gpu-manager-exporter.x86_64   4.0.1-1   cuda-rhel8-x86_64
datacenter-gpu-manager-exporter.x86_64   4.1.0-1   cuda-rhel8-x86_64
datacenter-gpu-manager-exporter.x86_64   4.1.1-1   cuda-rhel8-x86_64
datacenter-gpu-manager-exporter.x86_64   4.1.3-1   cuda-rhel8-x86_64
datacenter-gpu-manager-exporter.x86_64   4.5.0-1   cuda-rhel8-x86_64
datacenter-gpu-manager-exporter.x86_64   4.5.1-1   cuda-rhel8-x86_64
datacenter-gpu-manager-exporter.x86_64   4.5.2-1   cuda-rhel8-x86_64
datacenter-gpu-manager-exporter.x86_64   4.6.0-1   cuda-rhel8-x86_64

Code block. datacenter-gpu-manager-exporter module list check result example

Install datacenter-gpu-manager-cuda. dcgm-exporter 4.5.X requires glibc 2.34 or higher, but since glibc 2.34 is provided in RHEL9, we specify the version as 4.1.3-1 for installation.

Color mode

dnf install datacenter-gpu-manager-exporter-4.1.3-1

dnf install datacenter-gpu-manager-exporter-4.1.3-1

Code block. datacenter-gpu-manager-cuda installation

Color mode

Updating Subscription Management repositories.
Unable to read consumer identity
  
This system is not registered with an entitlement server. You can use subscription-manager to register.

Last metadata expiration check: 0:07:12 ago on Wed 19 Nov 2025 12:26:56 AM EST.
Dependencies resolved.
====================================================================================================
Package                                       Architecture   Version     Repository          Size
====================================================================================================
Installing:
 datacenter-gpu-manager-exporter               x86_64         4.1.3-1     cuda-rhel8-x86_64   26 M

...
Is this ok [y/N]: y

Updating Subscription Management repositories.
Unable to read consumer identity
  
This system is not registered with an entitlement server. You can use subscription-manager to register.

Last metadata expiration check: 0:07:12 ago on Wed 19 Nov 2025 12:26:56 AM EST.
Dependencies resolved.
====================================================================================================
Package                                       Architecture   Version     Repository          Size
====================================================================================================
Installing:
 datacenter-gpu-manager-exporter               x86_64         4.1.3-1     cuda-rhel8-x86_64   26 M

...
Is this ok [y/N]: y

Code block. datacenter-gpu-manager-cuda installation result example

Color mode

cat /usr/lib/systemd/system/nvidia-dcgm-exporter.service | grep ExecStart

cat /usr/lib/systemd/system/nvidia-dcgm-exporter.service | grep ExecStart

Code block. datacenter-gpu-manager-exporter configuration file

Color mode

ExecStart=/usr/bin/dcgm-exporter -f /etc/dcgm-exporter/default-counters.csv

ExecStart=/usr/bin/dcgm-exporter -f /etc/dcgm-exporter/default-counters.csv

Code block. datacenter-gpu-manager-exporter configuration file check result example

When installing DCGM Exporter, check the provided settings and remove # from the required metrics, and add # to the unnecessary metrics.

Color mode

vi /etc/dcgm-exporter/default-counters.csv
## Example ##
...
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active.
DCGM_FI_PROF_DRAM_ACTIVE,        gauge, Ratio of cycles the device memory interface is active sending or receiving data.
# DCGM_FI_PROF_PIPE_FP64_ACTIVE,   gauge, Ratio of cycles the fp64 pipes are active.
# DCGM_FI_PROF_PIPE_FP32_ACTIVE,   gauge, Ratio of cycles the fp32 pipes are active.
...

vi /etc/dcgm-exporter/default-counters.csv
## Example ##
...
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active.
DCGM_FI_PROF_DRAM_ACTIVE,        gauge, Ratio of cycles the device memory interface is active sending or receiving data.
# DCGM_FI_PROF_PIPE_FP64_ACTIVE,   gauge, Ratio of cycles the fp64 pipes are active.
# DCGM_FI_PROF_PIPE_FP32_ACTIVE,   gauge, Ratio of cycles the fp32 pipes are active.
...

Code block. datacenter-gpu-manager-exporter metric configuration example

Reference

For the metrics that can be collected with GPU DCGM Exporter and how to configure them, see DCGM Exporter Metrics.

Caution

DCGM Service Activation and Start (for RHEL)

Enable and start the nvdia-dcgm service.
Color mode
systemctl enable --now nvidia-dcgm
systemctl enable --now nvidia-dcgm
Code block. nvdia-dcgm service activation and start command
Activate and start the nvdia-dcgm-exporter service.
Color mode
systemctl enable --now nvidia-dcgm-exporter
systemctl enable --now nvidia-dcgm-exporter
Code block. nvdia-dcgm-exporter service activation and start command

Notice

DCGM Exporter Metrics

DCGM Exporter Key Metrics

Among the metrics provided by DCGM Exporter, the main GPU metrics are as follows.

Category	DCGM Field	Prometheus Metric Type	Summary
Clocks	DCGM_FI_DEV_SM_CLOCK	gauge	SM clock frequency (in MHz)
Clocks	DCGM_FI_DEV_MEM_CLOCK	gauge	Memory clock frequency (in MHz)
Temperature	DCGM_FI_DEV_GPU_TEMP	gauge	GPU temperature (in C)
Power	DCGM_FI_DEV_POWER_USAGE	gauge	Power draw (in W)
Utilization	DCGM_FI_DEV_GPU_UTIL	gauge	GPU utilization (in %)
Utilization	DCGM_FI_DEV_MEM_COPY_UTIL	gauge	Memory utilization (in %)
Memory Usage	DCGM_FI_DEV_FB_FREE	gauge	Frame buffer memory free (in MiB)
Memory Usage	DCGM_FI_DEV_FB_USED	gauge	Frame buffer memory used (in MiB)
Nvlink	DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL(8 GPU only)	counter	Total number of NVLink bandwidth counters for all lanes

Table. Major GPU metrics provided by DCGM Exporter

DCGM Exporter Metric Collection Settings

Please refer to the default metrics of DCGM Exporter at DCGM Exporter > Default Metrics.

For indicators to be set in addition to the default settings, remove # in default-counters.csv.
For metrics that are not desired to be collected among the default metrics, add # or delete the item.

Color mode

# Format
# If line starts with a '#' it is considered a comment
# DCGM FIELD, Prometheus metric type, help message

# Clocks
DCGM_FI_DEV_SM_CLOCK,  gauge, SM clock frequency (in MHz).
DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).

# Temperature
DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).
DCGM_FI_DEV_GPU_TEMP,    gauge, GPU temperature (in C).

# Power
DCGM_FI_DEV_POWER_USAGE,              gauge, Power draw (in W).
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ).

# PCIE
# DCGM_FI_PROF_PCIE_TX_BYTES,  counter, Total number of bytes transmitted through PCIe TX via NVML.
# DCGM_FI_PROF_PCIE_RX_BYTES,  counter, Total number of bytes received through PCIe RX via NVML.
...

# Format
# If line starts with a '#' it is considered a comment
# DCGM FIELD, Prometheus metric type, help message

# Clocks
DCGM_FI_DEV_SM_CLOCK,  gauge, SM clock frequency (in MHz).
DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).

# Temperature
DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).
DCGM_FI_DEV_GPU_TEMP,    gauge, GPU temperature (in C).

# Power
DCGM_FI_DEV_POWER_USAGE,              gauge, Power draw (in W).
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ).

# PCIE
# DCGM_FI_PROF_PCIE_TX_BYTES,  counter, Total number of bytes transmitted through PCIe TX via NVML.
# DCGM_FI_PROF_PCIE_RX_BYTES,  counter, Total number of bytes received through PCIe RX via NVML.
...

Code block. default-counters.csv setting example

How-to guides

Create GPU Server

GPU Server Check detailed information

Detailed Information

tag

Work History

GPU Server Operation Control

GPU Server Start

GPU Server Stop

GPU Server Restart

GPU Server Resource Management

Image Create

ServiceWatch Enable Detailed Monitoring

ServiceWatch Disable detailed monitoring

GPU Server Management Additional Features

Check console log

Create Dump

Rebuild perform

GPU Server Cancel

Termination Constraints

1 - Image Management

Image Generation

Image Check detailed information

Detailed Information

Tag

Work History

Image Resource Management

Share to another account

Receive sharing from another account

Image Delete

2 - Using Multi-instance GPU in GPU Server

Multi-instance GPU (NVIDIA A100) Overview

Using Multi-instance GPU Feature

MIG Application and Usage

MIG Activation

GPU Instance Creation

Compute Instance Creation

Using MIG

MIG Instance deletion and release

Compute Instance deletion

GPU Instance deletion

MIG Function Disablement (Deactivation)

3 - Using NVSwitch on GPU Server

Exploring NVIDIA NVSwitch for Multi GPU

Create GPU NVSwitch

NVSwitch Installation and Operation Check (Fabric Manager Activation)

4 - Keypair Management

Keypair Create

Keypair Check detailed information

Detailed Information

Tag

Work History

Keypair Resource Management

Get public key

Delete Keypair

5 - ServiceWatch Agent Install

ServiceWatch Agent

Install Prometheus Exporter for GPU metrics (for Ubuntu)

NVDIA Driver Installation Check

NVSwitch Configuration and Query (NSCQ) Library Installation

NVSwitch Device Monitoring API(NVSDM) Library Installation

NVIDIA DCGM Installation (for Ubuntu)

DCGM(datacenter-gpu-manager) Installation

datacenter-gpu-manager-exporter installation

DCGM Service activation and start

Installation of Prometheus Exporter for GPU metrics (for RHEL)

NVDIA Driver Installation Check (for RHEL)

NVSwitch Configuration and Query (NSCQ) Library installation (for RHEL)

NVSwitch Device Monitoring API(NVSDM) Library Installation (for RHEL)

NVIDIA DCGM Installation (for RHEL)

DCGM(datacenter-gpu-manager) Installation (for RHEL)

datacenter-gpu-manager-exporter installation (for RHEL)

DCGM Service Activation and Start (for RHEL)

DCGM Exporter Metrics

DCGM Exporter Key Metrics

DCGM Exporter Metric Collection Settings