This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

How-to guides

Users can create the service by entering the required GPU Server information and selecting detailed options through the Samsung Cloud Platform Console.

Creating a GPU Server

You can create and use a GPU Server service from the Samsung Cloud Platform Console.

To create a GPU server, follow the steps below.

  1. Click the All Services > Compute > GPU Server menu. You will be taken to the Service Home page of GPU Server.

  2. On the Service Home page, click the GPU Server Creation button. You will be taken to the GPU Server Creation page.

  3. On the GPU Server Creation page, enter the information required to create the service and select detailed options.

    • Select the required information in the Image and Version Selection area.
      Category
      Required status
      Detailed description
      ImageRequiredSelect the type of Image provided
      • Standard: Samsung Cloud Platform standard provided Image
        • RHEL, Ubuntu
      • Custom: User-created Image
      • Kubernetes: Image for Kubernetes
        • Ubuntu
      Image versionRequiredSelect the version of the chosen Image
      • Provides a list of versions for the offered server Image
      Table. GPU Server image and version selection input fields
    • Service Information Input area, enter or select the required information.
      Category
      Required status
      Detailed description
      Number of serversRequiredNumber of GPU Server servers to create concurrently
      • Only numeric input is allowed, enter a value between 1 and 100
      Service Type > Server TypeRequiredGPU Server server type
      • Indicates the specifications of a GPU-type server, allowing selection of servers with 1, 2, 4, or 8 GPUs
      Service Type > Planned ComputeSelectionResource status with Planned Compute configured
      • In Use: Number of resources with Planned Compute that are currently in use
      • Configured: Number of resources with Planned Compute configured
      • Coverage Preview: Amount applied per resource by Planned Compute
      • Apply for Planned Compute Service: Navigate to the Planned Compute service application page
      Block StorageRequiredConfigure the Block Storage used by the GPU Server according to its purpose
      • Basic: The area where the OS is installed and used
        • Capacity can be entered in units (the minimum capacity varies depending on the OS image type)
          • RHEL: Values between 3 and 1,536 can be entered
          • Ubuntu: Values between 3 and 1,536 can be entered
        • SSD: High‑performance general volume
        • HDD: General volume
        • SSD/HDD_KMS: Additional encrypted volume using Samsung Cloud Platform KMS (Key Management System) encryption keys
          • Encryption can be applied only at initial creation (cannot be changed after creation)
          • Performance degradation occurs when using the SSD_KMS disk type
        • SSD_Provisioned: SSD volume with configurable IOPS and throughput
      • Additional: Used when the user needs extra space beyond the OS area
        • After selecting Use, enter the storage type and capacity
        • To add storage, click the + button (up to 25 can be added); to delete, click the x button
        • Capacity can be entered in units, with values between 1 and 1,536
          • Since 1 Unit equals 8 GB, this creates 8 GB to 12,288 GB
        • SSD: High‑performance general volume
        • HDD: General volume
        • SSD/HDD_KMS: Additional encrypted volume using Samsung Cloud Platform KMS (Key Management System) encryption keys
          • Encryption can be applied only at initial creation (cannot be changed after creation)
          • Performance degradation may occur when using the SSD_KMS disk type
        • HDD/SSD_MultiAttach: Volume that can be attached to two or more servers
        • SSD_Provisioned: SSD volume with configurable IOPS and throughput
      • Delete on termination: If Delete on Termination is set to Enabled, the volume is terminated together with the server
        • Volumes with existing snapshots are not deleted even if Delete on termination is set to Enabled
        • A multi‑attach volume can be deleted only when the server being removed is the last remaining server attached to the volume
      Max IOPSRequiredEnter a maximum IOPS value between 5,000~20,000
      • Can be set only when disk type is SSD_Provisioned
      Max ThroughputRequiredEnter the maximum Throughput value between 250~1,000
      • Disk type can be set only when it is SSD_Provisioned
      Table. GPU Server Service Configuration Items
    • In the Required Information Input area, enter or select the necessary information.
      Category
      required or not
      Detailed description
      Server nameRequiredEnter a name to distinguish the server when the selected number of servers is 1
      • Set the hostname using the entered server name
      • Enter within 63 characters using letters, numbers, spaces, and special characters (- _)
      Server name PrefixRequiredEnter a Prefix to distinguish each server generated when the selected number of servers is 2 or more
      • Automatically generated in the form of user input value (prefix) + ‘-#
      • Enter within 59 characters using letters, numbers, spaces, and special characters (-, _)
      Network Settings > Create New Network PortRequiredConfigure the network where the GPU Server will be installed
      • Select a pre‑created VPC.
      • General Subnet: Select a pre‑created general Subnet
        • IP can be set to auto‑generate or manual entry; if manual is chosen, the user can directly input the IP
        • NAT: Available only when there is a single server and the VPC is attached to an Internet Gateway. Checking Use enables selection of a NAT IP
        • NAT IP: Select a NAT IP
          • If no NAT IP is available, click the Create New button to generate a Public IP
          • Click the Refresh button to view and select the created Public IP
          • Creating a Public IP incurs charges according to the Public IP pricing policy
      • Local Subnet (optional): Choose Use for a local Subnet
        • It is not a required element for creating the service
        • A pre‑created local Subnet must be selected
        • IP can be set to auto‑generate or manual entry; selecting Manual allows the user to input the IP directly
        • Security Group: Settings required to access the server
          • Select: Choose a pre‑created Security Group
          • Create New: If no applicable Security Group exists, you can create one separately in the Security Group service
          • Up to 5 can be selected
          • If no Security Group is set, all connections are blocked by default
          • You must configure a Security Group to allow required connections
      Network Settings > Existing Network Port AssignmentRequiredSet the network where the GPU server will be installed
      • Select a pre-created VPC
      • General Subnet: Select a pre-created general Subnet and Port
        • NAT: Available only when there is a single server and the VPC is connected to an Internet Gateway. When you check to use it, you can select a NAT IP.
        • NAT IP: Select a NAT IP
          • If there is no NAT IP to select, click the Create New button to generate a Public IP
          • Click the Refresh button to view and select the created Public IP
      • Local Subnet (optional): Select Use for the local Subnet
        • Select a pre-created local Subnet and Port
      KeypairRequiredUser authentication methods to use when connecting to the server
      • New creation: Create a new keypair if one is needed
      • Default login accounts by OS
        • RHEL: cloud-user
        • Ubuntu: ubuntu
      Table. Required input fields for GPU Server
    • Enter or select the required information in the Additional Information Input area.
      Category
      Required
      Detailed description
      LockSelectionLock usage setting
      • When Lock is enabled, it prevents actions such as server termination, start, and stop from being executed, thereby avoiding malfunctions caused by mistakes
      Init scriptSelectionScript executed when the server starts
      • The init script must be written as a Batch script for Windows, a Shell script for Linux, or cloud‑init, depending on the image type.
      • Up to 45,000 bytes can be entered
      tagSelectionAdd Tag
      • Up to 50 can be added per resource
      • After clicking the Add Tag button, enter or select Key, Value values
      Table. GPU Server additional information input fields
  4. Summary Check the detailed information and estimated billing amount generated in the panel, and click the Create button.

    • Once creation is complete, check the created resources on the GPU Server List page.

Check GPU Server detailed information

GPU Server service allows you to view and edit the full resource list and detailed information. GPU Server Details page consists of Details, Tags, Job History tabs.

To view detailed information about the GPU Server service, follow these steps.

  1. Click the All Services > Compute > GPU Server menu. You will be taken to the Service Home page of GPU Server.
  2. On the Service Home page, click the GPU Server menu. You will be taken to the GPU Server List page.
  3. On the GPU Server List page, click the resource you want to view details for. You will be taken to the GPU Server Details page.
    • GPU Server Details page displays status information and additional feature information, and is composed of Details, Tags, Activity History tabs.
    • GPU Server Additional Features for detailed information, please refer to GPU Server Management Additional Features.
      CategoryDetailed description
      GPU Server statusStatus of user-created GPU Server
      • Build: State where the Build command has been received
      • Building: Build in progress
      • Networking: Server creation networking process
      • Scheduling: Server creation scheduling process
      • Block_Device_Mapping: Connecting Block Storage during server creation
      • Spawning: Server creation process is ongoing
      • Active: Available state
      • Powering_off: State when a stop request is made
      • Deleting: Server deletion in progress
      • Reboot_Started: Reboot in progress
      • Error: Error state
      • Migrating: Server migrating to another host
      • Reboot: State where the Reboot command has been received
      • Rebooting: Rebooting
      • Rebuild: State where the Rebuild command has been received
      • Rebuilding: State during Rebuild request
      • Rebuild_Spawning: Rebuild process is ongoing
      • Resize: State where the Resize command has been received
      • Resizing: Resizing in progress
      • Resize_Prep: State when a server type change is requested
      • Resize_Migrating: Server is migrating to another host while resizing
      • Resize_Migrated: Server has completed migration to another host during resizing
      • Resize_Finish: Resize completed
      • Revert_Resize: Resize or migration of the server failed for some reason. The target server is cleaned up and the original server is restarted
      • Shutoff: State when powering off is completed
      • Verity_Resize: After Resize_Prep following a server type change request, the server type is confirmed or can be reverted
      • Resize_Reverting: State when a server type revert is requested
      • Resize_Confirming: State confirming the server’s Resize request
      Server controlButtons to change server status
      • Start: start a stopped server
      • Stop: stop a running server
      • Restart: restart a running server
      Image generationCreate a custom user image from the current server image
      Console logView console logs of the current server
      • You can view the console logs output by the current server. For more details, see Check console logs.
      Create dumpGenerate a dump of the current server
      • The dump file is created inside the GPU Server
      • For detailed dump creation instructions, refer to Create Dump
      RebuildAll data and settings of the existing server are deleted, and a new server is configured
      Service terminationCancel service button
      Table. GPU Server status information and additional features
Notice
When using the mig feature, after the GPU Server’s Rebooting state has finished, you need to recheck the mig settings.

Detailed Information

GPU Server List page lets you view detailed information of the selected resource and edit it if needed.

CategoryDetailed description
serviceService name
Resource TypeResource Type
SRNUnique resource ID in Samsung Cloud Platform
  • In the GPU Server service, it refers to the GPU Server SRN
Resource nameResource Name
  • In the GPU Server service, it refers to the GPU Server name
Resource IDUnique resource ID in the service
constructorUser who created the service
Creation date and timeService creation date and time
editorUser who edited the service information
Modification dateDate and time the service information was modified
Server nameServer name
Server typevCPU, memory, GPU information display
  • If you need to change to a different server type, click the Edit button to configure
image nameService OS image and version
LockIndicates whether Lock is used or not
  • If you need to change the Lock attribute value, click the Edit button to set it
Keypair nameServer authentication information set by the user
Planned ComputeResource status with Planned Compute configured
LLM EndpointURL for using LLM
ServiceWatch Detailed MonitoringWhen enabled, data monitoring is possible in the ServiceWatch service
  • You can set the activation status by clicking the Edit button
NetworkNetwork information of the GPU Server
  • VPC name, standard Subnet name, IP, NAT IP, NAT IP status, Security Group name
  • If you need to change the NAT IP value, click the Edit button to configure
  • If you need to change the Security Group, click the Edit button to configure
  • Add as new network: select a standard Subnet and IP
    • You can select a different standard Subnet within the same VPC
    • IP can be either auto-generated or manually entered; if you choose manual entry, you can input the IP directly
  • Add using existing port: select a pre-created standard Subnet and port
Local SubnetLocal Subnet information of the GPU Server
  • Local Subnet name, Local Subnet IP, Security Group name
  • If you need to change the Security Group, you can configure it by clicking the Edit button
  • Add to New Network: select a local Subnet and IP
    • You can select a different local Subnet within the same VPC
    • IP can be Auto-generated or manually entered; selecting Input allows the user to directly enter the IP
  • Add Existing Port: select a pre-created local Subnet and port
Block StorageInformation of Block Storage attached to the server
  • Volume ID, Volume Name, Type, Capacity, Connection Info, Category, Delete on termination, Status
  • Add: Additional Block Storage can be attached if needed
  • Edit Delete on termination: Modify the Delete on termination value
  • More > Detach: Detach the selected Block Storage from the list
Table. GPU Server detailed information tab items
Caution
When using ServiceWatch detailed monitoring, additional fees apply.

tag

GPU Server List page allows you to view the tag information of the selected resource, and you can add, modify, or delete it.

CategoryDetailed description
Tag listTag list
  • You can view the Key and Value information of the tag
  • Up to 50 tags can be added per resource
  • When entering a tag, you can search and select from the list of previously created Keys and Values
Table. GPU Server Tag Tab Items

Job History

You can view the job history of the selected resource on the GPU Server List page.

CategoryDetailed description
Task History ListResource Change History
  • Operation Time, Resource ID, Resource Name, Operation Details, Event Topic, Operation Result, Check Operator Information
Table. Work History Tab Detailed Information Items

Control GPU Server Operation

If you need to control the operation of a created GPU Server resource, you can perform the task on the GPU Server List or GPU Server Details page. You can start, stop, and restart a running server.

Getting Started with GPU Server

You can start a shutoff GPU Server. To start a GPU Server, follow the steps below.

  1. Click the All Services > Compute > GPU Server menu. You will be taken to the Service Home page of the GPU Server.
  2. On the Service Home page, click the GPU Server menu. You will be taken to the GPU Server List page.
  3. On the GPU Server List page, click the resource you want to start among the shutoff servers, and navigate to the GPU Server Details page.
    • GPU Server List page allows you to Start each resource via the right More button.
    • After selecting multiple servers with checkboxes, you can control multiple servers simultaneously using the Start button at the top.
  4. GPU Server Details page, click the Start button at the top to start the server. In the Status Display section, verify the updated server status.
    • When the GPU Server startup is complete, the server status changes from Shutoff to Active.
    • For detailed information about the GPU Server status, refer to GPU Server detailed information.

Stopping the GPU Server

You can stop a running (Active) GPU Server. To stop the GPU Server, follow the steps below.

  1. Click the All Services > Compute > GPU Server menu. You will be taken to the Service Home page of GPU Server.
  2. On the Service Home page, click the GPU Server menu. You will be taken to the GPU Server List page.
  3. On the GPU Server List page, click the resource to stop among the servers that are running (Active), and navigate to the GPU Server Details page.
    • On the GPU Server List page, you can Stop each resource via the right More button.
    • After selecting multiple servers with checkboxes, you can control multiple servers simultaneously using the Stop button at the top.
  4. GPU Server Details page, click the Stop button at the top to start the server. Check the changed server status in the Status Display item.
    • When the GPU server shutdown is complete, the server status changes from Active to Shutoff.
    • For detailed information about the GPU Server status, please refer to GPU Server Detailed Information.

Restart GPU Server

You can restart the created GPU Server. To restart the GPU Server, follow the steps below.

  1. Click the All Services > Compute > GPU Server menu. You will be taken to the Service Home page of GPU Server.
  2. On the Service Home page, click the GPU Server menu. You will be taken to the GPU Server List page.
  3. On the GPU Server List page, click the resource to restart and go to the GPU Server Details page.
    • On the GPU Server list page, you can restart each resource using the right More button.
    • After selecting multiple servers with checkboxes, you can control multiple servers simultaneously using the Restart button at the top.
  4. GPU Server Details on the page, click the Restart button at the top to start the server. Check the updated server status in the Status Display field.
    • During a GPU Server restart, the server status goes through Rebooting and finally changes to Active.
    • For detailed information about the GPU Server status, refer to GPU Server detailed information.

GPU Server resource management

If you need server control and management functions for the created GPU Server resources, you can perform the tasks on the GPU Server Resource List or GPU Server Details page.

Create Image

You can create an image of a running GPU server.

Reference

This guide explains how to create a custom user image from the image of a running GPU server.

  • On the GPU Server List or GPU Server Details page, click the Create Image button to create a user Custom Image.

To create an image of the GPU Server, follow these steps.

  1. Click the All Services > Compute > GPU Server menu. Navigate to the Service Home page of the GPU Server.

  2. On the Service Home page, click the GPU Server menu. You will be taken to the GPU Server List page.

  3. On the GPU Server List page, click the resource to create an Image. You will be taken to the GPU Server Details page.

  4. Click the Create Image button on the GPU Server Details page. You will be taken to the Image Creation page.

    • Enter the required information in the Service Information Input area.
      Category
      Required status
      Detailed description
      image nameRequiredEnter the name of the image to be created
      • using English letters, numbers, spaces, and special characters (- _) within 200 characters
      Table. Image service information input fields
  5. Check the input information and click the Create button.

    • When creation is complete, check the created resources on the All Services > Compute > GPU Server > Image List page.
Notice
  • When you create an Image, the generated Image is stored in the Object Storage used as internal storage. Therefore, Object Storage usage fees are applied.
  • Since the file system of an image created from an active GPU server cannot be guaranteed to be intact, it is recommended to stop the server before creating the image.

Enabling detailed monitoring for ServiceWatch

By default, the GPU Server is linked to the basic monitoring of the ServiceWatch and Virtual Server namespaces. You can enable detailed monitoring as needed to identify operational issues more quickly and take action. For detailed information about ServiceWatch, see ServiceWatch Overview.

Note
The GPU Server provides basic and detailed monitoring in the same namespace as the Virtual Server. GPU Server’s GPU metrics will be provided by the ServiceWatch Agent. (Planned for December 2025)
Caution
Basic monitoring is provided free of charge, but enabling detailed monitoring incurs additional fees. Please be aware when using it.

To enable detailed ServiceWatch monitoring on the GPU Server, follow these steps.

  1. Click the All Services > Compute > GPU Server menu. You will be taken to the Service Home page of GPU Server.
  2. On the Service Home page, click the GPU Server menu. You will be taken to the GPU Server List page.
  3. On the GPU Server List page, click the resource to enable ServiceWatch detailed monitoring. You will be taken to the GPU Server Details page.
  4. On the GPU Server Details page, click the ServiceWatch detailed monitoring Edit button. You will be taken to the ServiceWatch Detailed Monitoring Edit popup.
  5. ServiceWatch Detailed Monitoring Edit In the popup window, select Enable, review the instructions, and click the Confirm button.
  6. On the GPU Server Details page, check the ServiceWatch detailed monitoring items.

Disable detailed monitoring of ServiceWatch

Caution
For cost efficiency, detailed monitoring needs to be disabled. Keep detailed monitoring enabled only when absolutely necessary, and disable it for the rest.

To disable detailed monitoring of ServiceWatch on the GPU Server, follow these steps.

  1. Click the All Services > Compute > GPU Server menu. You will be taken to the Service Home page of GPU Server.
  2. On the Service Home page, click the GPU Server menu. You will be taken to the GPU Server List page.
  3. GPU Server List page, click the resource to disable ServiceWatch detailed monitoring. Navigate to the GPU Server Details page.
  4. GPU Server Details page, click the ServiceWatch detailed monitoring Edit button. You will be taken to the ServiceWatch detailed monitoring Edit popup.
  5. ServiceWatch Detailed Monitoring Edit In the popup window, after deselecting Enabled, review the guidance message and click the Confirm button.
  6. On the GPU Server Details page, view the ServiceWatch detailed monitoring items.

GPU Server Management Additional Features

GPU Server can view Console logs, generate Dumps, and perform Rebuilds for server management. To view Console logs, generate Dumps, or Rebuild a GPU Server, follow the steps below.

Check console log

You can view the current console log of the GPU Server.

To view the console logs of the GPU server, follow these steps.

  1. Click the All Services > Compute > GPU Server menu. You will be taken to the GPU Server’s Service Home page.
  2. On the Service Home page, click the GPU Server menu. You will be taken to the GPU Server List page.
  3. On the GPU Server List page, click the resource to view the console log. You will be taken to the GPU Server Details page.
  4. On the GPU Server Details page, click the Console Log button. You will be taken to the Console Log popup.
  5. Console Log Verify the console log displayed in the popup window.

Create Dump

To create a dump file on the GPU Server, follow these steps.

  1. Click the All Services > Compute > GPU Server menu. You will be taken to the Service Home page of GPU Server.
  2. On the Service Home page, click the GPU Server menu. You will be taken to the GPU Server List page.
  3. GPU Server List page, click the resource to view detailed information. You will be taken to the GPU Server Details page.
  4. On the GPU Server Details page, click the Create Dump button.
    • The dump file is created inside the GPU server.

Perform Rebuild

You can delete all data and settings of the existing GPU Server and rebuild it on a new server.

Follow these steps to perform a Rebuild of the GPU Server.

  1. Click the All Services > Compute > GPU Server menu. You will be taken to the Service Home page of GPU Server.
  2. On the Service Home page, click the GPU Server menu. You will be taken to the GPU Server List page.
  3. GPU Server List page, click the resource to perform Rebuild. GPU Server Details page will be displayed.
  4. On the GPU Server Details page, click the Rebuild button.
    • During a GPU Server Rebuild, the server status changes to Rebuilding, and when the Rebuild is complete, it returns to its state before the Rebuild.
    • For detailed information about the GPU Server status, refer to Check GPU Server details.

Terminate GPU Server

If you terminate an unused GPU Server, you can reduce operating costs. However, terminating a GPU Server may cause the running service to stop immediately, so you should carefully consider the impact of service interruption before proceeding with the termination.

Caution
Please note that data cannot be recovered after terminating the service.

To cancel the GPU Server, follow the steps below.

  1. Click the All Services > Compute > GPU Server menu. Go to the GPU Server’s Service Home page.
  2. On the Service Home page, click the GPU Server menu. You will be taken to the GPU Server List page.
  3. On the GPU Server List page, select the resource to cancel and click the Cancel Service button.
    • The termination of attached storage depends on the Delete on termination setting, so refer to Termination constraints.
  4. After termination is complete, check on the GPU Server List page whether the resources have been terminated.

Cancellation constraints

When a GPU Server termination request cannot be processed, a popup will provide guidance. Please refer to the cases below.

Cancellation not possible
  • If File Storage is connected please disconnect the File Storage connection first.
  • When the LB Pool is connected, please disconnect the LB Pool connection first.
  • If Lock is set please change the Lock setting to disabled and try again.

Termination of attached storage depends on the Delete on termination setting.

Delete on termination setting-specific deletion
  • Whether the volume is deleted also depends on the Delete on termination setting.
    • Delete on termination If not set: Even if you terminate the GPU Server, the volume will not be deleted.
    • When Delete on termination is set: If you terminate the GPU Server, the associated volume will be deleted.
  • Volumes that have snapshots will not be deleted even if Delete on termination is set.
  • A Multi‑attach volume can be deleted only when the server being deleted is the last remaining server attached to the volume.

1 - Manage Image

Users can create the service by entering the required information for the Image service within the GPU Server service and selecting detailed options through the Samsung Cloud Platform Console.

Create Image

You can create an image of a running GPU Server. To create an image of a GPU Server, please refer to Image Creation.

Check Image detailed information

Image service allows you to view and edit the full resource list and detailed information. The Image Details page consists of Detailed Information, Tags, Operation History tabs.

To view detailed information of the Image service, follow these steps.

  1. Click the All Services > Compute > GPU Server menu. You will be taken to the Service Home page of GPU Server.
  2. On the Service Home page, click the Image menu. You will be taken to the Image list page.
  3. On the Image List page, click the resource to view detailed information. You will be taken to the Image Detail page.
    • Image Details page displays status information and additional feature information, and consists of Details, Tags, Activity Log tabs.
      CategoryDetailed description
      Image statusStatus of user-created Image
      • Active: Available state
      • Queued: Image has been uploaded and is waiting for processing after creation
      • Importing: Image has been uploaded and is currently being processed after creation
      Share with another accountImage can be shared with another Account
      • The Image’s Visibility must be set to Shared in order to be shared with another Account
      Delete imageButton to delete the Image
      • Once the Image is deleted, it cannot be restored
      Table. GPU Server Image status information and additional features

Detailed Information

Image list page lets you view detailed information of the selected resource and modify it if necessary.

CategoryDetailed description
serviceService name
Resource TypeResource Type
SRNUnique resource ID in Samsung Cloud Platform
  • refers to the SRN of a GPU Server Image
Resource nameImage name
Resource IDImage ID
constructorUser who created the Image
Creation date and timeImage creation timestamp
editorUser who edited the Image
Modification dateImage modification timestamp
image nameImage name
Minimum diskMinimum disk capacity (GB) of the Image
  • If you need to modify the minimum disk, click the Edit button to set it
Minimum RAMMinimum RAM size (GB) of the Image
OS typeOS type of the image
OS hash algorithmOS hash algorithm method
VisibilityDisplay access permissions for the image
  • Private can be used only within the project, and Shared can be shared across projects
ProtectedSelect whether image deletion is prohibited
  • Checking Use can prevent accidental deletion of images
  • This setting can be changed after the image is created
image file URLImage file URL uploaded when generating an image
  • GPU Server detail page does not display images created through the image generation menu
Sharing statusCurrent status of sharing images with another Account
  • Approved Account ID: ID of the Account for which sharing is approved
  • Modification date and time: The date and time when sharing was requested to another Account; if the sharing status changes from Pending to Accepted, it is updated to that date and time
  • Status: Approved status
    • Accepted: sharing is approved and active
    • Pending: awaiting approval
  • Stop sharing: sharing has been stopped
  • Select the Account ID to stop sharing from the list, then click the Stop sharing button at the top of the list to stop sharing all at once
Table. Image detailed information tab items

tag

On the Image List page, you can view the tag information of the selected resource and add, modify, or delete it.

CategoryDetailed description
Tag listTag list
  • You can view the Key and Value information of the tag
  • Up to 50 tags can be added per resource
  • When entering a tag, you can search and select from the list of previously created Keys and Values
Table. Image tag tab items

Job History

You can view the operation history of the selected resource on the Image List page.

CategoryDetailed description
Task History ListResource Change History
  • Operation Time, Resource ID, Resource Name, Operation Details, Event Topic, Operation Result, Check Operator Information
Table. GPU Server Image Job History Tab Detailed Information Items

Image Resource Management

Describes the control and management functions of the generated Image.

Share to another Account

To share an Image with another Account, follow the steps below.

  1. Log in to the account to be shared and click the All Services > Compute > GPU Server menu. Go to the GPU Server’s Service Home page.
  2. On the Service Home page, click the Image menu. You will be taken to the Image List page.
  3. On the Image List page, click the Image you want to control. You will be taken to the Image Details page.
  4. Click the Share to another Account button. Navigate to the Share image to another Account page.
    • Share with another Account feature allows you to share an Image with another Account. To share an Image with another Account, the Image’s Visibility must be Shared.
  5. Share image to another Account page, enter the required information, and click the Done button.
    Category
    Required
    Detailed description
    image name-Name of the image to share
    • Input not allowed
    Image ID-Shareable image ID
    • Input not allowed
    Shared Account IDRequiredEnter another Account ID to share
    • English letters, numbers, special characters- within 64 characters
    Table. Required input fields for sharing images to another Account
  6. You can view the information in the sharing status of the Image Details page.
    • When the request is first made, the status is Pending, and it changes to Accepted once approval is completed by the account receiving the share.
Notice
Only images created by uploading an image file from the current user can be shared with another Account. If you create a Custom Image from the image of a running GPU Server, it cannot be shared with another Account, and this feature will be provided in the future, so please note.

Receive sharing from another Account

To receive an Image shared from another Account, follow these steps.

  1. Log in to the account to be shared and click the All Services > Compute > GPU Server menu. Navigate to the GPU Server’s Service Home page.
  2. On the Service Home page, click the Image menu. You will be taken to the Image List page.
  3. On the Image List page, click the Receive Image Share button. You will be taken to the Receive Image Share popup.
  4. Receive Image Sharing In the popup window, enter the Image’s resource ID you want to receive, and click the Confirm button.
  5. When image sharing is complete, you can view the shared Image in the Image list.

Delete Image

You can delete unused Images. However, since an Image cannot be recovered after deletion, you should carefully consider the impact before performing the deletion.

Caution
Please note that data cannot be recovered after deleting the service.

To delete the Image, follow these steps.

  1. Click the All Services > Compute > GPU Server menu. Go to the GPU Server’s Service Home page.
  2. On the Service Home page, click the Image menu. You will be taken to the Image List page.
  3. Image list page, select the resource to delete, and click the Delete button.
    • On the Image List page, select multiple Image check boxes and click the Delete button at the top of the resource list.
  4. After deletion is complete, verify on the Image list page that the resource has been removed.

2 - Manage Keypair

Users can create the service by entering the required Keypair information within the GPU Server service and selecting detailed options through the Samsung Cloud Platform Console.

Create a Keypair

You can create and use the Keypair service while using the GPU Server service in the Samsung Cloud Platform Console.

To create a keypair, follow these steps.

  1. Click the All Services > Compute > GPU Server menu. You will be taken to the Service Home page of GPU Server.
  2. On the Service Home page, click the Keypair menu. You will be taken to the Keypair List page.
  3. On the Keypair List page, click the Keypair Create button. You will be taken to the Keypair Create page.
    • Enter the required information in the Service Information Input area.
      Category
      Required
      Detailed description
      Keypair nameRequiredEnter the name of the Keypair to create
      • using English letters, numbers, spaces, and special characters (-, _) within 255 characters
      Keypair typeRequiredssh
      Table. Keypair service information input fields
    • Additional Information Input area, please enter or select the required information.
      Category
      Required status
      Detailed description
      tagSelectionAdd Tag
      • Up to 50 can be added per resource
      • After clicking the Add Tag button, enter or select Key, Value values
      Table. Keypair additional information input fields
      Caution
      • After creation is complete, you can download the Key only once. Since reissuance is not possible, make sure it has been downloaded.
      • Store the downloaded Private Key in a safe place.
  4. Check the input information and click the Create button.
    • After creation is complete, check the created resources on the Keypair List page.

View detailed information of the Keypair

The Keypair service allows you to view and edit the full resource list and detailed information. Keypair Details page consists of Details, Tags, Activity Log tabs.

To view detailed information about a keypair, follow these steps.

  1. Click the All Services > Compute > GPU Server menu. You will be taken to the Service Home page of GPU Server.
  2. On the Service Home page, click the Keypair menu. You will be taken to the Keypair List page.
  3. On the Keypair List page, click the resource to view its details. You will be taken to the Keypair Details page.
    • Keypair Details page displays status information and additional feature information, and consists of Details, Tags, Activity Log tabs.

Detailed Information

Keypair List page allows you to view detailed information of the selected resource and edit the information if needed.

CategoryDetailed description
serviceService name
Resource TypeResource Type
SRNUnique resource ID in Samsung Cloud Platform
  • In Keypair, it refers to the Keypair SRN
Resource nameKeypair name
Resource IDKeypair’s unique resource ID
constructorUser who created the keypair
Creation date and timeKeypair creation timestamp
editorUser who modified the keypair information
Modification dateDate and time the keypair information was modified
Keypair nameKeypair name
FingerprintA unique value for identifying the key
User IDUser ID of the keypair creator
public keyPublic key information
Table. Keypair detailed information tab items

tag

On the Keypair List page, you can view the tag information of the selected resource, and add, modify, or delete it.

CategoryDetailed description
Tag listTag list
  • You can view the Key and Value information of the tag
  • Up to 50 tags can be added per resource
  • When entering a tag, you can search and select from the list of previously created Keys and Values
Table. Keypair Tag Tab Items

Job History

On the Keypair List page, you can view the operation history of the selected resource.

CategoryDetailed description
Task History ListResource Change History
  • Operation Time, Resource ID, Resource Name, Operation Details, Event Topic, Operation Result, Check operator information
Table. Keypair operation history tab detailed information items

Keypair Resource Management

Describes the control and management functions of a keypair.

Get public key

To retrieve the public key, follow these steps.

  1. Click the All Services > Compute > GPU Server menu. You will be taken to the Service Home page of GPU Server.

  2. On the Service Home page, click the Keypair menu. You will be taken to the Keypair List page.

  3. On the Keypair List page, click the More button at the top and then click the Import Public Key button. You will be taken to the Import Public Key page.

    • Enter or select the required information in the Required Information Input area.
      Category
      Required
      Detailed description
      Keypair nameRequiredName of the Keypair to create
      Keypair typeRequiredssh
      public keyRequiredEnter public key
      • Load file: Select the Attach file button to attach the public key file
        • Only files with the following extension (.pem) can be attached
      • Enter public key: Paste the copied public key value
        • The public key value can be copied from the Keypair Details page
      Table. Required input fields for retrieving the public key
  4. Review the entered information and click the Complete button.

    • Once creation is complete, check the created resources on the Keypair List page.

Delete Keypair

You can delete unused Keypairs. However, once a Keypair is deleted it cannot be recovered, so please review the impact thoroughly beforehand before proceeding with deletion.

Caution
Please note that data cannot be recovered after deleting the service.

To delete a keypair, follow these steps.

  1. Click the All Services > Compute > GPU Server menu. Go to the Service Home page of GPU Server.
  2. On the Service Home page, click the Keypair menu. You will be taken to the Keypair List page.
  3. On the Keypair List page, select the resource to delete, and click the Delete button.
  • On the Keypair List page, select multiple Keypair check boxes and click the Delete button at the top of the resource list.
  1. After deletion is complete, check the Keypair List page to confirm that the resource has been removed.

3 - Use Multi-instance GPU on GPU Server

After creating a GPU Server, you can enable the MIG (Multi-instance GPU) feature on the GPU Server’s VM (Guest OS) and create an instance for use.

NVIDIA Multi-instance GPU Introduction

NVIDIA Multi-instance GPU (hereafter referred to as MIG) supports safely partitioning a GPU into GPU instances and running CUDA applications starting with the NVIDIA Ampere architecture. Through this, multiple users can each utilize different GPU resources to achieve optimal GPU utilization. This feature is especially useful for workloads that do not fully utilize the GPU’s computing capacity, and users can run multiple workloads in parallel to maximize utilization.

Using Multi-instance GPU feature

To use the MIG feature, create an NVIDIA GPU Server on the Samsung Cloud Platform, then enable and disable MIG. The order of applying and removing MIG is as follows.

MIG application order
Enable MIG → Create GPU Instance → Create Compute Instance → Use MIG
MIG release order
Delete Compute Instance → Delete GPU Instance → Disable MIG feature (deactivate)
Reference
  • MIG can be used on Samsung Cloud Platform’s next-generation GPU Server or MNGC (Multi-node GPU Cluster).
  • For system requirements to use MIG, refer to the NVIDIA Multi-Instance GPU User Guide.

Applying and Using MIG

After activating the MIG and creating an Instance to assign tasks, the tasks proceed in the following order.

MIG application order
Enable MIG → Create GPU Instance → Create Compute Instance → Use MIG
Note
The example of applying MIG is explained based on an A100 GPU server.

Activate MIG

  1. Check the GPU status on the VM Instance (GuestOS) before applying MIG.

    • Check whether MIG mode is Disabled.
      Color mode
      $ nvidia-smi
      Mon Sep 27 08:37:08 2021
      +-----------------------------------------------------------------------------+
      | NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4 |
      | -------------------------------+----------------------+---------------------- |
      | GPU  Name        Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
      | Fan  Temp  Perf  Pwr:Usage/Cap | Memory-Usage | GPU-Util  Compute M. |
      |  |  | MIG M. |
      | ===============================+======================+====================== |
      | 0  NVDIA A100-SXM...  Off | 00000000:05:00.0 Off | 0 |
      | N/A   32C   P0    59W / 400W | 0MiB / 81251MiB | 0%      Default |
      |  |  | Disabled |
      +-------------------------------+----------------------+----------------------+
      
      +-----------------------------------------------------------------------------+
      | Processes: |
      | GPU   GI   CI       PID   Type   Process name                   GPU Memory |
      | ID   ID                                                   Usage |
      | ============================================================================= |
      | No running processes found |
      +-----------------------------------------------------------------------------+
      $ nvidia-smi
      Mon Sep 27 08:37:08 2021
      +-----------------------------------------------------------------------------+
      | NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4 |
      | -------------------------------+----------------------+---------------------- |
      | GPU  Name        Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
      | Fan  Temp  Perf  Pwr:Usage/Cap | Memory-Usage | GPU-Util  Compute M. |
      |  |  | MIG M. |
      | ===============================+======================+====================== |
      | 0  NVDIA A100-SXM...  Off | 00000000:05:00.0 Off | 0 |
      | N/A   32C   P0    59W / 400W | 0MiB / 81251MiB | 0%      Default |
      |  |  | Disabled |
      +-------------------------------+----------------------+----------------------+
      
      +-----------------------------------------------------------------------------+
      | Processes: |
      | GPU   GI   CI       PID   Type   Process name                   GPU Memory |
      | ID   ID                                                   Usage |
      | ============================================================================= |
      | No running processes found |
      +-----------------------------------------------------------------------------+
      Code block. nvidia-smi command - Check GPU disabled status (1)
      Color mode
      $ nvidia-smi –L
      GPU 0: NVIDIA A100-SXM-80GB (UUID: GPU-c956838f-494a-92b2-6818-56eb28fe25e0)
      $ nvidia-smi –L
      GPU 0: NVIDIA A100-SXM-80GB (UUID: GPU-c956838f-494a-92b2-6818-56eb28fe25e0)
      Code block. nvidia-smi command - Check GPU disabled status (2)
  2. Enable MIG (Enable) for each GPU on the VM Instance (GuestOS) and reboot the VM Instance.

    Color mode
    $ nvidia-smi –I 0 –mig 1
    Enabled MIG mode for GPU 00000000:05:00.0
    All done.
    
    # reboot
    $ nvidia-smi –I 0 –mig 1
    Enabled MIG mode for GPU 00000000:05:00.0
    All done.
    
    # reboot
    Code block. nvidia-smi command - enable MIG

Reference

When using a GPU and configuring MIG, you may encounter the following warning message. If the warning appears, check whether any programs are running on the GPU.

Warning: MIG mode is in pending enable state for GPU 00000000:05:00.0: In use by another client. 00000000:05:00.0 is currently being used by one or more other processes (e.g. CUDA application or a monitoring application such as another instance of nvidia-smi).
  1. Check the GPU status after applying MIG on the VM Instance(GuestOS).
    • Check whether MIG mode is Enabled.
      Color mode
      $ nvidia-smi
      Mon Sep 27 09:44:33 2021
      +-----------------------------------------------------------------------------+
      | NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4 |
      | -------------------------------+----------------------+---------------------- |
      | GPU  Name        Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
      | Fan  Temp  Perf  Pwr:Usage/Cap | Memory-Usage | GPU-Util  Compute M. |
      |  |  | MIG M. |
      | ===============================+======================+====================== |
      | 0  NVDIA A100-SXM...  Off | 00000000:05:00.0 Off | On |
      | N/A   32C   P0    59W / 400W | 0MiB / 81251MiB | 0%      Default |
      |  |  | Enabled |
      +-------------------------------+----------------------+----------------------+
      +-----------------------------------------------------------------------------+
      | MIG devices: |
      +-----------------------------------------------------------------------------+
      | GPU  GI  CI  MIG | Memory-Usage | Vol | Shared |
      | ID  ID  Dev | BAR1-Usage | SM     Unc | CE  ENC  DEC  OFA  JPG |
      |  |  | ECC |  |
      | ============================================================================= |
      | No MIG devices found |
      +-----------------------------------------------------------------------------+
      +-----------------------------------------------------------------------------+
      | Processes: |
      | GPU   GI   CI       PID   Type   Process name                   GPU Memory |
      | ID   ID                                                   Usage |
      | ============================================================================= |
      | No running processes found |
      +-----------------------------------------------------------------------------+
      $ nvidia-smi
      Mon Sep 27 09:44:33 2021
      +-----------------------------------------------------------------------------+
      | NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4 |
      | -------------------------------+----------------------+---------------------- |
      | GPU  Name        Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
      | Fan  Temp  Perf  Pwr:Usage/Cap | Memory-Usage | GPU-Util  Compute M. |
      |  |  | MIG M. |
      | ===============================+======================+====================== |
      | 0  NVDIA A100-SXM...  Off | 00000000:05:00.0 Off | On |
      | N/A   32C   P0    59W / 400W | 0MiB / 81251MiB | 0%      Default |
      |  |  | Enabled |
      +-------------------------------+----------------------+----------------------+
      +-----------------------------------------------------------------------------+
      | MIG devices: |
      +-----------------------------------------------------------------------------+
      | GPU  GI  CI  MIG | Memory-Usage | Vol | Shared |
      | ID  ID  Dev | BAR1-Usage | SM     Unc | CE  ENC  DEC  OFA  JPG |
      |  |  | ECC |  |
      | ============================================================================= |
      | No MIG devices found |
      +-----------------------------------------------------------------------------+
      +-----------------------------------------------------------------------------+
      | Processes: |
      | GPU   GI   CI       PID   Type   Process name                   GPU Memory |
      | ID   ID                                                   Usage |
      | ============================================================================= |
      | No running processes found |
      +-----------------------------------------------------------------------------+
      Code block. nvidia-smi command - Check GPU activation status (1)
      Color mode
      $ nvidia-smi –L
      GPU 0: NVIDIA A100-SXM-80GB (UUID: GPU-c956838f-494a-92b2-6818-56eb28fe25e0)
      $ nvidia-smi –L
      GPU 0: NVIDIA A100-SXM-80GB (UUID: GPU-c956838f-494a-92b2-6818-56eb28fe25e0)
      Code block. nvidia-smi command - Check GPU activation status (2)

GPU Instance creation

If you have enabled MIG and verified its status, you can create a GPU Instance.

  1. Check the list of MIG GPU Instance profiles that can be created.
    Color mode
    $ nvidia-smi mig -i [GPU ID] -lgip
    $ nvidia-smi mig -i [GPU ID] -lgip
    Code block. nvidia-smi command - view MIG GPU Instance profile list
Color mode
$ nvidia-smi mig -i 0 -lgip
+-----------------------------------------------------------------------------+
| GPU instance profiles: |
| GPU   Name             ID    Instances   Memory     P2P    SM    DEC   ENC |
| Free/Total   GiB              CE    JPEG  OFA |
| ============================================================================= |
| 0 MIG 1g.10gb        19    7/7         9.50       No     14     0     0 |
| 1     0     0 |
+-----------------------------------------------------------------------------+
| 0 MIG 1g.10gb+me     20    1/1         9.50       No     14     0     0 |
| 1     1     1 |
+-----------------------------------------------------------------------------+
| 0 MIG 2g.20gb        14    3/3         19.50      No     28     1     0 |
| 2     0     0 |
+-----------------------------------------------------------------------------+
| 0 MIG 3g.40gb         9    2/2         39.50      No     42     2     0 |
| 3     0     0 |
+-----------------------------------------------------------------------------+
| 0 MIG 4g.40gb         5    1/1         39.50      No     56     2     0 |
| 4     0     0 |
+-----------------------------------------------------------------------------+
| 0 MIG 7g.80gb         0    1/1         79.25      No     98     0     0 |
| 7     1     1 |
+-----------------------------------------------------------------------------+
$ nvidia-smi mig -i 0 -lgip
+-----------------------------------------------------------------------------+
| GPU instance profiles: |
| GPU   Name             ID    Instances   Memory     P2P    SM    DEC   ENC |
| Free/Total   GiB              CE    JPEG  OFA |
| ============================================================================= |
| 0 MIG 1g.10gb        19    7/7         9.50       No     14     0     0 |
| 1     0     0 |
+-----------------------------------------------------------------------------+
| 0 MIG 1g.10gb+me     20    1/1         9.50       No     14     0     0 |
| 1     1     1 |
+-----------------------------------------------------------------------------+
| 0 MIG 2g.20gb        14    3/3         19.50      No     28     1     0 |
| 2     0     0 |
+-----------------------------------------------------------------------------+
| 0 MIG 3g.40gb         9    2/2         39.50      No     42     2     0 |
| 3     0     0 |
+-----------------------------------------------------------------------------+
| 0 MIG 4g.40gb         5    1/1         39.50      No     56     2     0 |
| 4     0     0 |
+-----------------------------------------------------------------------------+
| 0 MIG 7g.80gb         0    1/1         79.25      No     98     0     0 |
| 7     1     1 |
+-----------------------------------------------------------------------------+
Code block. MIG GPU Instance profile list
Reference
Refer to the NVIDIA Multi-Instance GPU User Guide for GPU Instance profiles.
  1. After creating a MIG GPU Instance, check it.
    • Create GPU Instance

      Color mode
      $ nvidia-smi mig -i [GPU ID] -cgi [Profile ID]
      $ nvidia-smi mig -i [GPU ID] -cgi [Profile ID]
      code block. nvidia-smi command - GPU Instance creation
      Color mode
      $ nvidia-smi mig -i 0 -cgi 0
      Successfully created GPU instance ID 0 on GPU 0 using profile MIG 7g.80gb (ID 0)
      $ nvidia-smi mig -i 0 -cgi 0
      Successfully created GPU instance ID 0 on GPU 0 using profile MIG 7g.80gb (ID 0)
      Code block. nvidia-smi command - Example of creating a GPU Instance

    • Check GPU Instance

      Color mode
      $ nvidia-smi mig -i [GPU ID] -lgi
      $ nvidia-smi mig -i [GPU ID] -lgi
      code block. nvidia-smi command - check GPU Instance
      Color mode
      $ nvidia-smi mig -i 0 -lgi
      +--------------------------------------------------------+
      | GPU instances: |
      | GPU   Name               Profile  Instance  Placement |
      | ID       ID      Start:Size |
      | ======================================================== |
      | 0  MIG 7g.80gb            0        0         0:8 |
      +--------------------------------------------------------+
      $ nvidia-smi mig -i 0 -lgi
      +--------------------------------------------------------+
      | GPU instances: |
      | GPU   Name               Profile  Instance  Placement |
      | ID       ID      Start:Size |
      | ======================================================== |
      | 0  MIG 7g.80gb            0        0         0:8 |
      +--------------------------------------------------------+
      Code block. nvidia-smi command - example of checking GPU Instance

Compute Instance creation

If you have created a GPU Instance, you can create a Compute Instance.

  1. Check the MIG Compute Instance profiles you can create.

    Color mode
    $ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] -lcip
    $ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] -lcip
    Code block. nvidia-smi command - Check MIG Compute Instance profile
    Color mode
    $ nvidia-smi mig -i 0 -gi 0 -lcip
    +---------------------------------------------------------------------------------+
    | Compute instance profiles: |
    | GPU     GPU     Name            Profile  Instances   Exclusive      Shared |
    | GPU   Instance                     ID    Free/Total     SM       DEC  ENC  OFA |
    | ID                                                       CE   JPEG |
    | ================================================================================= |
    | 0      0      MIG 1c.7g.80gb     0      7/5           14       5    0    1 |
    | 7    1 |
    +---------------------------------------------------------------------------------+
    | 0      0      MIG 2c.7g.80gb     1      3/3           28       5    0    1 |
    | 7    1 |
    +---------------------------------------------------------------------------------+
    | 0      0      MIG 3c.7g.80gb     2      2/2           42       5    0    1 |
    | 7    1 |
    +---------------------------------------------------------------------------------+
    | 0      0      MIG 4c.7g.80gb     3      1/1           56       5    0    1 |
    | 7    1 |
    +---------------------------------------------------------------------------------+
    | 0      0      MIG 7g.80gb        4*     1/1           98       5    0    1 |
    | 7    1 |
    +---------------------------------------------------------------------------------+
    $ nvidia-smi mig -i 0 -gi 0 -lcip
    +---------------------------------------------------------------------------------+
    | Compute instance profiles: |
    | GPU     GPU     Name            Profile  Instances   Exclusive      Shared |
    | GPU   Instance                     ID    Free/Total     SM       DEC  ENC  OFA |
    | ID                                                       CE   JPEG |
    | ================================================================================= |
    | 0      0      MIG 1c.7g.80gb     0      7/5           14       5    0    1 |
    | 7    1 |
    +---------------------------------------------------------------------------------+
    | 0      0      MIG 2c.7g.80gb     1      3/3           28       5    0    1 |
    | 7    1 |
    +---------------------------------------------------------------------------------+
    | 0      0      MIG 3c.7g.80gb     2      2/2           42       5    0    1 |
    | 7    1 |
    +---------------------------------------------------------------------------------+
    | 0      0      MIG 4c.7g.80gb     3      1/1           56       5    0    1 |
    | 7    1 |
    +---------------------------------------------------------------------------------+
    | 0      0      MIG 7g.80gb        4*     1/1           98       5    0    1 |
    | 7    1 |
    +---------------------------------------------------------------------------------+
    Code block. Example of MIG Compute Instance profile list

  2. Create and verify a MIG Compute Instance.

    • MIG Compute Instance creation
      Color mode
      $ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] -cci [Compute Profile ID]
      $ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] -cci [Compute Profile ID]
      code block. nvidia-smi command - Create MIG Compute Instance
      Color mode
      $ nvidia-smi mig -i 0 -gi 0 -cci 4
      Successfully created compute instance ID 0 on GPU instance ID 0 using profile MIG 7g.80gb(ID 4)
      $ nvidia-smi mig -i 0 -gi 0 -cci 4
      Successfully created compute instance ID 0 on GPU instance ID 0 using profile MIG 7g.80gb(ID 4)
      Code block. nvidia-smi command - Example of creating a MIG Compute Instance
    • Check MIG Compute Instance
      Color mode
      $ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] –lci
      $ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] –lci
      code block. nvidia-smi command - check MIG Compute Instance
      Color mode
      $ nvidia-smi mig -i 0 -gi 0 –lci
      +-----------------------------------------------------------------+
      | Compute instance profiles: |
      | GPU     GPU     Name            Profile  Instances   Placement |
      | GPU   Instance                     ID      ID        Start:Size |
      | ID |
      | ================================================================= |
      | 0      0      MIG 7g.80gb         4       0            0:7 |
      +-----------------------------------------------------------------+
      $ nvidia-smi mig -i 0 -gi 0 –lci
      +-----------------------------------------------------------------+
      | Compute instance profiles: |
      | GPU     GPU     Name            Profile  Instances   Placement |
      | GPU   Instance                     ID      ID        Start:Size |
      | ID |
      | ================================================================= |
      | 0      0      MIG 7g.80gb         4       0            0:7 |
      +-----------------------------------------------------------------+
      Code block. Example of checking MIG Compute Instance
      Color mode
      $ nvidia-smi –L
      GPU 0: NVIDIA A100-SXM-80GB (UUID: GPU-c956838f-494a-92b2-6818-56eb28fe25e0)
        MIG 7g.80gb     Device  0: (UUID: MIG-53e20040-758b-5ecb-948e-c626d03a9a32)
      $ nvidia-smi –L
      GPU 0: NVIDIA A100-SXM-80GB (UUID: GPU-c956838f-494a-92b2-6818-56eb28fe25e0)
        MIG 7g.80gb     Device  0: (UUID: MIG-53e20040-758b-5ecb-948e-c626d03a9a32)
      Code block. nvidia-smi command - Check GPU status (1)
      Color mode
      $ nvidia-smi
      Mon Sep 27 09:52:17 2021
      +-----------------------------------------------------------------------------+
      | NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4 |
      | -------------------------------+----------------------+---------------------- |
      | GPU  Name        Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
      | Fan  Temp  Perf  Pwr:Usage/Cap | Memory-Usage | GPU-Util  Compute M. |
      |  |  | MIG M. |
      | ===============================+======================+====================== |
      | 0  NVDIA A100-SXM...  Off | 00000000:05:00.0 Off | On |
      | N/A   32C   P0    49W / 400W | 0MiB / 81251MiB | N/A      Default |
      |  |  | Enabled |
      +-------------------------------+----------------------+----------------------+
      
      +-----------------------------------------------------------------------------+
      | MIG devices: |
      +-----------------------------------------------------------------------------+
      | GPU  GI  CI  MIG | Memory-Usage | Vol | Shared |
      | ID  ID  Dev | BAR1-Usage | SM     Unc | CE  ENC  DEC  OFA  JPG |
      |  |  | ECC |  |
      | ============================================================================= |
      | 0    0   0    0 | 0MiB / 81251MiB | 98      0 | 7   0    5    1    1 |
      |  | 1MiB / 13107... |  |  |
      +-----------------------------------------------------------------------------+
      +-----------------------------------------------------------------------------+
      | Processes: |
      | GPU   GI   CI       PID   Type   Process name                   GPU Memory |
      | ID   ID                                                   Usage |
      | ============================================================================= |
      | No running processes found |
      +-----------------------------------------------------------------------------+
      $ nvidia-smi
      Mon Sep 27 09:52:17 2021
      +-----------------------------------------------------------------------------+
      | NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4 |
      | -------------------------------+----------------------+---------------------- |
      | GPU  Name        Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
      | Fan  Temp  Perf  Pwr:Usage/Cap | Memory-Usage | GPU-Util  Compute M. |
      |  |  | MIG M. |
      | ===============================+======================+====================== |
      | 0  NVDIA A100-SXM...  Off | 00000000:05:00.0 Off | On |
      | N/A   32C   P0    49W / 400W | 0MiB / 81251MiB | N/A      Default |
      |  |  | Enabled |
      +-------------------------------+----------------------+----------------------+
      
      +-----------------------------------------------------------------------------+
      | MIG devices: |
      +-----------------------------------------------------------------------------+
      | GPU  GI  CI  MIG | Memory-Usage | Vol | Shared |
      | ID  ID  Dev | BAR1-Usage | SM     Unc | CE  ENC  DEC  OFA  JPG |
      |  |  | ECC |  |
      | ============================================================================= |
      | 0    0   0    0 | 0MiB / 81251MiB | 98      0 | 7   0    5    1    1 |
      |  | 1MiB / 13107... |  |  |
      +-----------------------------------------------------------------------------+
      +-----------------------------------------------------------------------------+
      | Processes: |
      | GPU   GI   CI       PID   Type   Process name                   GPU Memory |
      | ID   ID                                                   Usage |
      | ============================================================================= |
      | No running processes found |
      +-----------------------------------------------------------------------------+
      Code block. nvidia-smi command - Check GPU status (2)

Using MIG

  1. Use the MIG Instance to perform the Job.
    • Example of task execution
      Color mode
      $ docker run --gpus '"device=[GPU ID]:[MIG ID]"' -rm nvcr.io/nvidia/cuda nvidia-smi
      $ docker run --gpus '"device=[GPU ID]:[MIG ID]"' -rm nvcr.io/nvidia/cuda nvidia-smi
      Code block. Task execution example
    • You can see an example of the work performed as follows.
      Color mode
      $ docker run --gpus '"device=0:0"' -rm -it --network=host --shm-size=1g --ipc=host -v /root/.ssh/:/root/.ssh
      
      ================
      == TensorFlow ==
      ================
      
      NVIDIA Release 21.08-tf1 (build 26012104)
      TensorFlow Version 1.15.5
      
      Container image Copyright (c) 2021, NVIDIA CORPORATION. All right reserved.
      ...
      
      # Run Python process
      root@d622a93c9281:/workspace# python /workspace/nvidia-examples/cnn/resnet.py --num_iter 100
      ...
      PY 3.8.10 (default, Jun 2 2021, 10:49:15)
      [GCC 9.4.0]
      TF 1.15.5
      ...
      $ docker run --gpus '"device=0:0"' -rm -it --network=host --shm-size=1g --ipc=host -v /root/.ssh/:/root/.ssh
      
      ================
      == TensorFlow ==
      ================
      
      NVIDIA Release 21.08-tf1 (build 26012104)
      TensorFlow Version 1.15.5
      
      Container image Copyright (c) 2021, NVIDIA CORPORATION. All right reserved.
      ...
      
      # Run Python process
      root@d622a93c9281:/workspace# python /workspace/nvidia-examples/cnn/resnet.py --num_iter 100
      ...
      PY 3.8.10 (default, Jun 2 2021, 10:49:15)
      [GCC 9.4.0]
      TF 1.15.5
      ...
      Code block. Operation result
  2. Check the GPU usage. (Create JOB process)
    • When the job runs, you can see that a process is allocated to the MIG device and its utilization increases.
      Color mode
      $ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] -lcip
      $ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] -lcip
      code block. nvidia-smi command - check GPU utilization
    • You can check the GPU usage as shown below.
      Color mode
      +-----------------------------------------------------------------------------+
      | MIG devices: |
      +-----------------------------------------------------------------------------+
      | GPU  GI  CI  MIG | Memory-Usage | Vol | Shared |
      | ID  ID  Dev | BAR1-Usage | SM     Unc | CE  ENC  DEC  OFA  JPG |
      |  |  | ECC |  |
      | ============================================================================= |
      | 0    0   0    0 | 66562MiB / 81251MiB | 98      0 | 7   0    5    1    1 |
      |  | 5MiB / 13107... |  |  |
      +-----------------------------------------------------------------------------+
      +-----------------------------------------------------------------------------+
      | Processes: |
      | GPU   GI   CI       PID   Type   Process name                   GPU Memory |
      | ID   ID                                                   Usage |
      | ============================================================================= |
      | 0     0    0     17483      C   python                           66559MiB |
      +-----------------------------------------------------------------------------+
      +-----------------------------------------------------------------------------+
      | MIG devices: |
      +-----------------------------------------------------------------------------+
      | GPU  GI  CI  MIG | Memory-Usage | Vol | Shared |
      | ID  ID  Dev | BAR1-Usage | SM     Unc | CE  ENC  DEC  OFA  JPG |
      |  |  | ECC |  |
      | ============================================================================= |
      | 0    0   0    0 | 66562MiB / 81251MiB | 98      0 | 7   0    5    1    1 |
      |  | 5MiB / 13107... |  |  |
      +-----------------------------------------------------------------------------+
      +-----------------------------------------------------------------------------+
      | Processes: |
      | GPU   GI   CI       PID   Type   Process name                   GPU Memory |
      | ID   ID                                                   Usage |
      | ============================================================================= |
      | 0     0    0     17483      C   python                           66559MiB |
      +-----------------------------------------------------------------------------+
      Code block. Example of checking GPU utilization.

Delete and release MIG Instance

Follow these steps to delete the MIG instance and detach the MIG.

MIG release order
Delete Compute Instance → Delete GPU Instance → Disable MIG feature (deactivate)

Compute Instance Delete

  • Delete the Compute Instance.
    Color mode
    $ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] –dci
    $ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] -ci [Compute Instance] –dci
    $ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] –dci
    $ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] -ci [Compute Instance] –dci
    Code block. nvidia-smi command - Delete Compute Instance
    Color mode
    $ nvidia-smi mig -i 0 -gi 0 –lci
    +-----------------------------------------------------------------+
    | Compute instance profiles: |
    | GPU     GPU     Name            Profile  Instances   Placement |
    | GPU   Instance                     ID      ID        Start:Size |
    | ID |
    | ================================================================= |
    | 0      0      MIG 7g.80gb         4       0            0:7 |
    +-----------------------------------------------------------------+
    $ nvidia-smi mig -i 0 -gi 0 –lci
    +-----------------------------------------------------------------+
    | Compute instance profiles: |
    | GPU     GPU     Name            Profile  Instances   Placement |
    | GPU   Instance                     ID      ID        Start:Size |
    | ID |
    | ================================================================= |
    | 0      0      MIG 7g.80gb         4       0            0:7 |
    +-----------------------------------------------------------------+
    code block. Example of checking MIG Compute Instance
    Color mode
    $ nvidia-smi mig -i 0 -gi 0 –dci
    Successfully destroyed compute instance ID  0 from GPU instance ID  0
    $ nvidia-smi mig -i 0 -gi 0 –dci
    Successfully destroyed compute instance ID  0 from GPU instance ID  0
    Code block. Compute Instance deletion example
    Color mode
    $ nvidia-smi mig -i 0 -gi 0 –lci
    No compute instances found: Not found
    $ nvidia-smi mig -i 0 -gi 0 –lci
    No compute instances found: Not found
    Code block. Confirm Compute Instance deletion

Delete GPU Instance

  • Delete the GPU Instance.
    Color mode
    $ nvidia-smi mig -i [GPU ID] –dgi
    $ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] –dgi
    $ nvidia-smi mig -i [GPU ID] –dgi
    $ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] –dgi
    Code block. nvidia-smi command - Delete GPU Instance
    Color mode
    $ nvidia-smi mig -i 0 -lgi
    +--------------------------------------------------------+
    | GPU instances: |
    | GPU   Name               Profile  Instance  Placement |
    | ID       ID      Start:Size |
    | ======================================================== |
    | 0  MIG 7g.80gb            0        0         0:8 |
    +--------------------------------------------------------+
    $ nvidia-smi mig -i 0 -lgi
    +--------------------------------------------------------+
    | GPU instances: |
    | GPU   Name               Profile  Instance  Placement |
    | ID       ID      Start:Size |
    | ======================================================== |
    | 0  MIG 7g.80gb            0        0         0:8 |
    +--------------------------------------------------------+
    Code block. nvidia-smi command - Example of checking GPU Instance
    Color mode
    $ nvidia-smi mig -i 0 -dgi
    Successfully destroyed GPU instance ID  0 from GPU  0
    $ nvidia-smi mig -i 0 -dgi
    Successfully destroyed GPU instance ID  0 from GPU  0
    Code block. nvidia-smi command - GPU Instance deletion example
    Color mode
    $ nvidia-smi mig -i 0 -lgi
    No GPU instances found: Not found
    $ nvidia-smi mig -i 0 -lgi
    No GPU instances found: Not found
    code block. nvidia-smi command - example of deleting a GPU Instance

Disable MIG feature (deactivation)

  • After disabling MIG (Disable), reboot.
    Color mode
    $ nvidia-smi -mig 0
    Disabled MIG Mode for GPU 00000000:05:00.0
    
    All done.
    $ nvidia-smi -mig 0
    Disabled MIG Mode for GPU 00000000:05:00.0
    
    All done.
    Code block. nvidia-smi command - disable MIG
    Color mode
    $ nvidia-smi
    Mon Sep 30 05:18:28 2021
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4 |
    | -------------------------------+----------------------+---------------------- |
    | GPU  Name        Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap | Memory-Usage | GPU-Util  Compute M. |
    |  |  | MIG M. |
    | ===============================+======================+====================== |
    | 0  NVDIA A100-SXM...  Off | 00000000:05:00.0 Off | 0 |
    | N/A   33C   P0    60W / 400W | 0MiB / 81251MiB | 0%      Default |
    |  |  | Disabled |
    +-------------------------------+----------------------+----------------------+
    +-----------------------------------------------------------------------------+
    | MIG devices: |
    +-----------------------------------------------------------------------------+
    | GPU  GI  CI  MIG | Memory-Usage | Vol | Shared |
    | ID  ID  Dev | BAR1-Usage | SM     Unc | CE  ENC  DEC  OFA  JPG |
    |  |  | ECC |  |
    | ============================================================================= |
    | No MIG devices found |
    +-----------------------------------------------------------------------------+
    +-----------------------------------------------------------------------------+
    | Processes: |
    | GPU   GI   CI       PID   Type   Process name                   GPU Memory |
    | ID   ID                                                   Usage |
    | ============================================================================= |
    | No running processes found |
    +-----------------------------------------------------------------------------+
    $ nvidia-smi
    Mon Sep 30 05:18:28 2021
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4 |
    | -------------------------------+----------------------+---------------------- |
    | GPU  Name        Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap | Memory-Usage | GPU-Util  Compute M. |
    |  |  | MIG M. |
    | ===============================+======================+====================== |
    | 0  NVDIA A100-SXM...  Off | 00000000:05:00.0 Off | 0 |
    | N/A   33C   P0    60W / 400W | 0MiB / 81251MiB | 0%      Default |
    |  |  | Disabled |
    +-------------------------------+----------------------+----------------------+
    +-----------------------------------------------------------------------------+
    | MIG devices: |
    +-----------------------------------------------------------------------------+
    | GPU  GI  CI  MIG | Memory-Usage | Vol | Shared |
    | ID  ID  Dev | BAR1-Usage | SM     Unc | CE  ENC  DEC  OFA  JPG |
    |  |  | ECC |  |
    | ============================================================================= |
    | No MIG devices found |
    +-----------------------------------------------------------------------------+
    +-----------------------------------------------------------------------------+
    | Processes: |
    | GPU   GI   CI       PID   Type   Process name                   GPU Memory |
    | ID   ID                                                   Usage |
    | ============================================================================= |
    | No running processes found |
    +-----------------------------------------------------------------------------+
    Code block. nvidia-smi command - check GPU status

4 - Use NVSwitch on GPU Server

After creating a GPU Server, you can enable the NVSwitch feature on the GPU Server’s VM (Guest OS) and use fast GPU-to-GPU P2P communication.

Caution
Only the GPU Server (8 GPU) and Multi-node GPU Cluster of Samsung Cloud Platform are connected with NVSwitch and NVLink.

Exploring NVIDIA NVSwitch for Multi GPU

NVLink expands I/O by directly connecting multiple GPUs within a server both bidirectionally and GPU-to-GPU. Using NVSwitch, you can connect all GPUs in a server with full NVLink bandwidth.

Checking NVSwitch operation

Check the NVIDIA Fabric Manager, NVIDIA NVLink topology, and NVIDIA NVLink Status on the GPU server.

Reference
The example for checking NVSwitch operation is explained using the A100 GPU Server (g1v128a8) as a reference.

NVIDIA Fabric Manager operating status

Verify that active (running) is displayed when operating normally.

~$ systemctl status nvidia-fabricmanager
Color mode
nvidia-fabricmanager.service - NVIDIA fabric manager service
     Loaded: loaded (/lib/systemd/system/nvidia-fabricmanager.service; enabled; vendor preset: enabled)
     Active: active (running) since Mon 2026-02-02 16:23:27 KST; 32min ago
   Main PID: 2191 (nv-fabricmanage)
      Tasks: 18 (limit: 629145)
     Memory: 18.0M
        CPU: 33.461s
     CGroup: /system.slice/nvidia-fabricmanager.service
             └─2191 /usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabricmanager.cfg
nvidia-fabricmanager.service - NVIDIA fabric manager service
     Loaded: loaded (/lib/systemd/system/nvidia-fabricmanager.service; enabled; vendor preset: enabled)
     Active: active (running) since Mon 2026-02-02 16:23:27 KST; 32min ago
   Main PID: 2191 (nv-fabricmanage)
      Tasks: 18 (limit: 629145)
     Memory: 18.0M
        CPU: 33.461s
     CGroup: /system.slice/nvidia-fabricmanager.service
             └─2191 /usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabricmanager.cfg
Code block. Check NVIDIA Fabric Manager status

Check NVIDIA NVLink topology

Check the NVIDIA NVLink topology.

~$ nvidia-smi topo -m
Color mode
nvidia-smi topo -m
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV12    NV12    NV12    NV12    NV12    NV12    NV12    0-127   0-7             N/A
GPU1    NV12     X      NV12    NV12    NV12    NV12    NV12    NV12    0-127   0-7             N/A
GPU2    NV12    NV12     X      NV12    NV12    NV12    NV12    NV12    0-127   0-7             N/A
GPU3    NV12    NV12    NV12     X      NV12    NV12    NV12    NV12    0-127   0-7             N/A
GPU4    NV12    NV12    NV12    NV12     X      NV12    NV12    NV12    0-127   0-7             N/A
GPU5    NV12    NV12    NV12    NV12    NV12     X      NV12    NV12    0-127   0-7             N/A
GPU6    NV12    NV12    NV12    NV12    NV12    NV12     X      NV12    0-127   0-7             N/A
GPU7    NV12    NV12    NV12    NV12    NV12    NV12    NV12     X      0-127   0-7             N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks
nvidia-smi topo -m
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV12    NV12    NV12    NV12    NV12    NV12    NV12    0-127   0-7             N/A
GPU1    NV12     X      NV12    NV12    NV12    NV12    NV12    NV12    0-127   0-7             N/A
GPU2    NV12    NV12     X      NV12    NV12    NV12    NV12    NV12    0-127   0-7             N/A
GPU3    NV12    NV12    NV12     X      NV12    NV12    NV12    NV12    0-127   0-7             N/A
GPU4    NV12    NV12    NV12    NV12     X      NV12    NV12    NV12    0-127   0-7             N/A
GPU5    NV12    NV12    NV12    NV12    NV12     X      NV12    NV12    0-127   0-7             N/A
GPU6    NV12    NV12    NV12    NV12    NV12    NV12     X      NV12    0-127   0-7             N/A
GPU7    NV12    NV12    NV12    NV12    NV12    NV12    NV12     X      0-127   0-7             N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks
Code block. Check NVIDIA NVLink topology

Check NVIDIA NVLink Status

Check the NVIDIA NVLink Status.

~$ nvidia-smi topo -m
Color mode
GPU 1: NVIDIA A100-SXM4-80GB (UUID: GPU-64a2f685-bb12-c4af-105c-0726ece9c8d7)
         Link 0: 25 GB/s
         Link 1: 25 GB/s
         Link 2: 25 GB/s
         Link 3: 25 GB/s
         Link 4: 25 GB/s
         Link 5: 25 GB/s
         Link 6: 25 GB/s
         Link 7: 25 GB/s
         Link 8: 25 GB/s
         Link 9: 25 GB/s
         Link 10: 25 GB/s
         Link 11: 25 GB/s
GPU 2: NVIDIA A100-SXM4-80GB (UUID: GPU-2269851b-71cd-f6c7-50c5-ba1525cf3ce8)
         Link 0: 25 GB/s
         Link 1: 25 GB/s
         Link 2: 25 GB/s
         Link 3: 25 GB/s
         Link 4: 25 GB/s
         Link 5: 25 GB/s
         Link 6: 25 GB/s
         Link 7: 25 GB/s
         Link 8: 25 GB/s
         Link 9: 25 GB/s
         Link 10: 25 GB/s
         Link 11: 25 GB/s
GPU 3: NVIDIA A100-SXM4-80GB (UUID: GPU-4c397bbf-95fc-5c29-918a-a429cbe45a7a)
         Link 0: 25 GB/s
         Link 1: 25 GB/s
         Link 2: 25 GB/s
         Link 3: 25 GB/s
         Link 4: 25 GB/s
         Link 5: 25 GB/s
         Link 6: 25 GB/s
         Link 7: 25 GB/s
         Link 8: 25 GB/s
         Link 9: 25 GB/s
         Link 10: 25 GB/s
         Link 11: 25 GB/s
GPU 4: NVIDIA A100-SXM4-80GB (UUID: GPU-0e350204-9fb6-2cbe-538e-8f7849658eb8)
         Link 0: 25 GB/s
         Link 1: 25 GB/s
         Link 2: 25 GB/s
         Link 3: 25 GB/s
         Link 4: 25 GB/s
         Link 5: 25 GB/s
         Link 6: 25 GB/s
         Link 7: 25 GB/s
         Link 8: 25 GB/s
         Link 9: 25 GB/s
         Link 10: 25 GB/s
         Link 11: 25 GB/s
GPU 5: NVIDIA A100-SXM4-80GB (UUID: GPU-45f0c453-4760-edd4-3af9-25c5ea7473a5)
         Link 0: 25 GB/s
         Link 1: 25 GB/s
         Link 2: 25 GB/s
         Link 3: 25 GB/s
         Link 4: 25 GB/s
         Link 5: 25 GB/s
         Link 6: 25 GB/s
         Link 7: 25 GB/s
         Link 8: 25 GB/s
         Link 9: 25 GB/s
         Link 10: 25 GB/s
         Link 11: 25 GB/s
GPU 6: NVIDIA A100-SXM4-80GB (UUID: GPU-38409794-bb34-430e-3c50-90b42cb2bb72)
         Link 0: 25 GB/s
         Link 1: 25 GB/s
         Link 2: 25 GB/s
         Link 3: 25 GB/s
         Link 4: 25 GB/s
         Link 5: 25 GB/s
         Link 6: 25 GB/s
         Link 7: 25 GB/s
         Link 8: 25 GB/s
         Link 9: 25 GB/s
         Link 10: 25 GB/s
         Link 11: 25 GB/s
GPU 7: NVIDIA A100-SXM4-80GB (UUID: GPU-3fb478aa-801b-eb64-55c2-0ffc3f2ce404)
         Link 0: 25 GB/s
         Link 1: 25 GB/s
         Link 2: 25 GB/s
         Link 3: 25 GB/s
         Link 4: 25 GB/s
         Link 5: 25 GB/s
         Link 6: 25 GB/s
         Link 7: 25 GB/s
         Link 8: 25 GB/s
         Link 9: 25 GB/s
         Link 10: 25 GB/s
         Link 11: 25 GB/s
GPU 1: NVIDIA A100-SXM4-80GB (UUID: GPU-64a2f685-bb12-c4af-105c-0726ece9c8d7)
         Link 0: 25 GB/s
         Link 1: 25 GB/s
         Link 2: 25 GB/s
         Link 3: 25 GB/s
         Link 4: 25 GB/s
         Link 5: 25 GB/s
         Link 6: 25 GB/s
         Link 7: 25 GB/s
         Link 8: 25 GB/s
         Link 9: 25 GB/s
         Link 10: 25 GB/s
         Link 11: 25 GB/s
GPU 2: NVIDIA A100-SXM4-80GB (UUID: GPU-2269851b-71cd-f6c7-50c5-ba1525cf3ce8)
         Link 0: 25 GB/s
         Link 1: 25 GB/s
         Link 2: 25 GB/s
         Link 3: 25 GB/s
         Link 4: 25 GB/s
         Link 5: 25 GB/s
         Link 6: 25 GB/s
         Link 7: 25 GB/s
         Link 8: 25 GB/s
         Link 9: 25 GB/s
         Link 10: 25 GB/s
         Link 11: 25 GB/s
GPU 3: NVIDIA A100-SXM4-80GB (UUID: GPU-4c397bbf-95fc-5c29-918a-a429cbe45a7a)
         Link 0: 25 GB/s
         Link 1: 25 GB/s
         Link 2: 25 GB/s
         Link 3: 25 GB/s
         Link 4: 25 GB/s
         Link 5: 25 GB/s
         Link 6: 25 GB/s
         Link 7: 25 GB/s
         Link 8: 25 GB/s
         Link 9: 25 GB/s
         Link 10: 25 GB/s
         Link 11: 25 GB/s
GPU 4: NVIDIA A100-SXM4-80GB (UUID: GPU-0e350204-9fb6-2cbe-538e-8f7849658eb8)
         Link 0: 25 GB/s
         Link 1: 25 GB/s
         Link 2: 25 GB/s
         Link 3: 25 GB/s
         Link 4: 25 GB/s
         Link 5: 25 GB/s
         Link 6: 25 GB/s
         Link 7: 25 GB/s
         Link 8: 25 GB/s
         Link 9: 25 GB/s
         Link 10: 25 GB/s
         Link 11: 25 GB/s
GPU 5: NVIDIA A100-SXM4-80GB (UUID: GPU-45f0c453-4760-edd4-3af9-25c5ea7473a5)
         Link 0: 25 GB/s
         Link 1: 25 GB/s
         Link 2: 25 GB/s
         Link 3: 25 GB/s
         Link 4: 25 GB/s
         Link 5: 25 GB/s
         Link 6: 25 GB/s
         Link 7: 25 GB/s
         Link 8: 25 GB/s
         Link 9: 25 GB/s
         Link 10: 25 GB/s
         Link 11: 25 GB/s
GPU 6: NVIDIA A100-SXM4-80GB (UUID: GPU-38409794-bb34-430e-3c50-90b42cb2bb72)
         Link 0: 25 GB/s
         Link 1: 25 GB/s
         Link 2: 25 GB/s
         Link 3: 25 GB/s
         Link 4: 25 GB/s
         Link 5: 25 GB/s
         Link 6: 25 GB/s
         Link 7: 25 GB/s
         Link 8: 25 GB/s
         Link 9: 25 GB/s
         Link 10: 25 GB/s
         Link 11: 25 GB/s
GPU 7: NVIDIA A100-SXM4-80GB (UUID: GPU-3fb478aa-801b-eb64-55c2-0ffc3f2ce404)
         Link 0: 25 GB/s
         Link 1: 25 GB/s
         Link 2: 25 GB/s
         Link 3: 25 GB/s
         Link 4: 25 GB/s
         Link 5: 25 GB/s
         Link 6: 25 GB/s
         Link 7: 25 GB/s
         Link 8: 25 GB/s
         Link 9: 25 GB/s
         Link 10: 25 GB/s
         Link 11: 25 GB/s
Code block. Check NVIDIA NVLink status

5 - Install ServiceWatch Agent

Users can install the ServiceWatch Agent on a GPU server to collect custom metrics and logs.

Reference
Collecting custom metrics/logs via the ServiceWatch Agent is currently available only on Samsung Cloud Platform For Enterprise. It will also be available in other offerings in the future.
Caution
Metric collection through the ServiceWatch Agent is classified as custom metrics and, unlike the default metrics collected from each service, incurs charges; therefore, we recommend removing or disabling unnecessary metric collection settings.

ServiceWatch Agent

There are two main types of agents that need to be installed on a GPU server to collect custom metrics and logs for ServiceWatch. It is a Prometheus Exporter and Open Telemetry Collector.

CategoryDetailed description
Prometheus ExporterProvide metrics of a specific application or service in a format that Prometheus can scrape
  • To collect OS metrics from servers, you can use Node Exporter for Linux servers and Windows Exporter for Windows servers, depending on the OS type.
Open Telemetry CollectorActs as a centralized collector that gathers telemetry data such as metrics and logs from distributed systems, processes (filtering, sampling, etc.) it, and exports it to multiple backends (e.g., Prometheus, Jaeger, Elasticsearch, etc.)
  • Exports data to the ServiceWatch Gateway so that ServiceWatch can collect metric and log data.
Table. Explanation of Prometheus Exporter and Open Telemetry Collector
Caution

If you have configured a Kubernetes Engine on a GPU server, please check GPU metrics using the metrics provided by the Kubernetes Engine.

  • If you install the DCGM Exporter on a GPU server configured with Kubernetes Engine, it may not operate correctly.

Pre-configuration for Using ServiceWatch Agent

To utilize the ServiceWatch Agent, please refer to Prerequisite Settings for ServiceWatch Agent and prepare the prerequisite settings.

Install Prometheus Exporter for GPU metrics (for Ubuntu)

Install the Prometheus Exporter for collecting metrics from the GPU server according to the steps below.

Verify NVDIA Driver Installation

  • Check the installed NVDIA Driver.
    Color mode
    nvidia-smi --query-gpu driver_version --format csv
    nvidia-smi --query-gpu driver_version --format csv
    Code block. NVDIA Driver version check command
    Color mode
    driver_version
    535.183.06
    ...
    535.183.06
    driver_version
    535.183.06
    ...
    535.183.06
    Code block. NVDIA Driver version check example

NVSwitch Configuration and Query (NSCQ) Library Installation

Reference
The NVSwitch Configuration and Query (NSCQ) Library is required for Hopper or earlier generation GPUs.
Information
The installation commands below can be used in an environment with internet access. If you are in an environment without internet access, you must download libnvdia-nscq from https://developer.download.nvidia.com/compute/cuda/repos/ and upload it.
  1. Install cuda-keyring.

    Color mode
    wget https://developer.download.nvidia.com/compute/cuda/repos/<distro>/<arch>/cuda-keyring_1.1-1_all.deb
    wget https://developer.download.nvidia.com/compute/cuda/repos/<distro>/<arch>/cuda-keyring_1.1-1_all.deb
    code block. NSCQ library download command
    Color mode
    sudo dpkg -i cuda-keyring_1.1-1_all.deb
    apt update
    sudo dpkg -i cuda-keyring_1.1-1_all.deb
    apt update
    Code block. NSCQ library installation command
    Color mode
    nvidia-smi --query-gpu driver_version --format csv
    nvidia-smi --query-gpu driver_version --format csv
    Code block. NVDIA Driver version check command
    Color mode
    driver_version
    535.183.06
    ...
    
    535.183.06
    driver_version
    535.183.06
    ...
    
    535.183.06
    code block. Example of checking NVDIA driver version

  2. Install libnvidia-nscq.

    Color mode
    apt-cache policy libnvidia-nscq-535
    apt-cache policy libnvidia-nscq-535
    Code block. NSCQ library apt-cache command
    Color mode
    libnvidia-nscq-535:
      Installed: (none)
      Candidate: 535.247.01-1
      Version table:
         535.247.01-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
    ...
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         535.216.01-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         535.183.06-1 600  # Install the version that matches the driver
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         535.183.01-1 600
    
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         535.54.03-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
    libnvidia-nscq-535:
      Installed: (none)
      Candidate: 535.247.01-1
      Version table:
         535.247.01-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
    ...
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         535.216.01-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         535.183.06-1 600  # Install the version that matches the driver
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         535.183.01-1 600
    
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         535.54.03-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
    Code block. NSCQ library apt-cache command result
    Color mode
    apt install libnvidia-nscq-535=535.183.06-1
    apt install libnvidia-nscq-535=535.183.06-1
    Code block. NSCQ library installation command

information

It must be installed with the same version as the NVDIA Driver.

  • Example) driver version: 535.183.06, libnvdia-nscq version: 535.183.06-1

NVSwitch Device Monitoring API(NVSDM) Library Installation

Reference
For GPU architectures after Blackwell, installing the NVSDM Library is required. NVDIA Driver versions below 560 do not provide the NVSDM Library.
  • Install the NVSDM library.
    Color mode
    apt-cache policy libnvsdm
    apt-cache policy libnvsdm
    Code block. NVSDM library apt-cache command
    Color mode
    libnvsdm:
      Installed: (none)
      Candidate: 580.105.08-1
      Version table:
         580.105.08-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         580.95.05-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         580.82.07-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         580.65.06-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
    libnvsdm:
      Installed: (none)
      Candidate: 580.105.08-1
      Version table:
         580.105.08-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         580.95.05-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         580.82.07-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         580.65.06-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
    Code block. NVSDM library apt-cache command result
    Color mode
    apt install libnvsdm=580.105.08-1
    apt install libnvsdm=580.105.08-1
    Code block. Install NVSDM library

Install NVIDIA DCGM (for Ubuntu)

Install the DCGM Exporter according to the steps below.

  1. DCGM(datacenter-gpu-manager) Installation
  2. datacenter-gpu-manager-exporter installation
  3. DCGM Service Activation and Start

DCGM(datacenter-gpu-manager) Installation

refers to a specific version of NVIDIA’s Data Center GPU Manager (DCGM) tool, which is a package for managing and monitoring NVIDIA data center GPUs. In particular, cuda12 indicates that this management tool is installed for the CUDA 12 version, and datacenter-gpu-manager-4 refers to the 4.x version of DCGM. This tool provides a variety of features, including GPU status monitoring, diagnostics, alert system, and power/clock management.

  1. Check the CUDA version.
    Color mode
    nvidia-smi | grep CUDA
    nvidia-smi | grep CUDA
    Code block. Check CUDA version
    Color mode
    | NVIDIA-SMI 535.183.06             Driver Version: 535.183.06     CUDA Version: 12.2 |
    | NVIDIA-SMI 535.183.06             Driver Version: 535.183.06     CUDA Version: 12.2 |
    Code block. Example of CUDA version check result
    Color mode
    CUDA_VERSION=12
    CUDA_VERSION=12
    Code block. CUDA version setting command
  2. Install datacenter-gpu-manager-cuda.
    Color mode
    apt install datacenter-gpu-manager-4-cuda${CUDA_VERSION}
    apt install datacenter-gpu-manager-4-cuda${CUDA_VERSION}
    Code block. datacenter-gpu-manager-cuda installation command

Install datacenter-gpu-manager-exporter

It is a tool that, based on NVIDIA Data Center GPU Manager (DCGM), collects various GPU metrics such as GPU usage, memory usage, temperature, and power consumption, and exposes them for use in monitoring systems like Prometheus.

  1. Install datacenter-gpu-manager-exporter.
    Color mode
    apt install datacenter-gpu-manager-exporter
    apt install datacenter-gpu-manager-exporter
    Code block. datacenter-gpu-manager-exporter installation command
  2. Check the DCGM Exporter configuration file.
    Color mode
    cat /usr/lib/systemd/system/nvidia-dcgm-exporter.service | grep ExecStart
    cat /usr/lib/systemd/system/nvidia-dcgm-exporter.service | grep ExecStart
    Code block. Command to check the datacenter-gpu-manager-exporter configuration file
    Color mode
    ExecStart=/usr/bin/dcgm-exporter -f /etc/dcgm-exporter/default-counters.csv
    ExecStart=/usr/bin/dcgm-exporter -f /etc/dcgm-exporter/default-counters.csv
    code block. Example of verification result for the datacenter-gpu-manager-exporter configuration file
  3. When installing the DCGM Exporter, review the provided configuration and remove # for the required metrics, and add # for the unnecessary metrics.
    Color mode
    vi /etc/dcgm-exporter/default-counters.csv
    ## Example ##
    ...
    DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active.
    DCGM_FI_PROF_DRAM_ACTIVE,        gauge, Ratio of cycles the device memory interface is active sending or receiving data.
    # DCGM_FI_PROF_PIPE_FP64_ACTIVE,   gauge, Ratio of cycles the fp64 pipes are active.
    # DCGM_FI_PROF_PIPE_FP32_ACTIVE,   gauge, Ratio of cycles the fp32 pipes are active.
    ...
    vi /etc/dcgm-exporter/default-counters.csv
    ## Example ##
    ...
    DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active.
    DCGM_FI_PROF_DRAM_ACTIVE,        gauge, Ratio of cycles the device memory interface is active sending or receiving data.
    # DCGM_FI_PROF_PIPE_FP64_ACTIVE,   gauge, Ratio of cycles the fp64 pipes are active.
    # DCGM_FI_PROF_PIPE_FP32_ACTIVE,   gauge, Ratio of cycles the fp32 pipes are active.
    ...
    Code block. Example of datacenter-gpu-manager-exporter metric configuration
Reference
See DCGM Exporter Metrics for the metrics that can be collected with the GPU DCGM Exporter and how to configure them.
Caution
Since metric collection through the ServiceWatch Agent is classified as custom metrics and incurs charges unlike the default collected metrics, unnecessary metric collection must be removed or disabled to avoid excessive charges.

Enable and start DCGM service

  1. Enable and start the nvdia-dcgm service.

    Color mode
    systemctl enable --now nvidia-dcgm
    systemctl enable --now nvidia-dcgm
    Code block. nvdia-dcgm service activation and start command

  2. Enable and start the nvdia-dcgm-exporter service.

    Color mode
    systemctl enable --now nvidia-dcgm-exporter
    systemctl enable --now nvidia-dcgm-exporter
    Code block. nvdia-dcgm-exporter service activation and start command

information
If you have completed the DCGM Exporter setup, you must install the Open Telemetry Collector provided by ServiceWatch to finish configuring the ServiceWatch Agent.
For more details, refer to ServiceWatch > ServiceWatch Agent 사용하기.

Install Prometheus Exporter for GPU metrics (for RHEL)

Install the ServiceWatch Agent to collect metrics from the GPU server according to the steps below.

NVDIA Driver installation verification (for RHEL)

  1. Check the installed NVDIA Driver.
    Color mode
    nvidia-smi --query-gpu driver_version --format csv
    nvidia-smi --query-gpu driver_version --format csv
    Code block. NVDIA Driver version check command
    Color mode
    driver_version
    535.183.06
    ...
    
    535.183.06
    driver_version
    535.183.06
    ...
    
    535.183.06
    Code block. Example of checking NVDIA Driver version

NVSwitch Configuration and Query (NSCQ) Library installation (for RHEL)

Reference

The NVSwitch Configuration and Query (NSCQ) Library is required for Hopper or earlier generation GPUs.

  • For RHEL, verify that libnvdia-nscq is installed and install it if necessary.
information
The installation commands below can be executed in an environment with internet access. If you are in an environment without internet access, you must download libnvdia-nscq from https://developer.download.nvidia.com/compute/cuda/repos/ and upload it.
  1. Checking the libnvdia-nscq package.

    Color mode
    rpm -qa | grep libnvidia-nscq libnvidia-nscq-535-535.183.06-1.x86_64
    rpm -qa | grep libnvidia-nscq libnvidia-nscq-535-535.183.06-1.x86_64
    Code block. Check NSCQ library package

  2. Add the CUDA Repository to DNF.

    Color mode
    dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
    dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
    Code block. Add DNF Repository

  3. NVDIA Driver status reset

    Color mode
    dnf module reset nvidia-driver
    dnf module reset nvidia-driver
    Code block. Initialize the state of the NVIDIA Driver DNF module
    Color mode
    Updating Subscription Management repositories.
    Last metadata expiration check: 0:03:15 ago on Wed 19 Nov 2025 01:23:48 AM EST.
    Dependencies resolved.
    =============================================
    Package Architecture Version Repository Size
    =============================================
    Disabling module profiles:
    nvidia-driver/default
    nvidia-driver/fm
    Resetting modules:
    nvidia-driver
    
    Transaction Summary
    =============================================
    
    Is this ok [y/N]: y
    Updating Subscription Management repositories.
    Last metadata expiration check: 0:03:15 ago on Wed 19 Nov 2025 01:23:48 AM EST.
    Dependencies resolved.
    =============================================
    Package Architecture Version Repository Size
    =============================================
    Disabling module profiles:
    nvidia-driver/default
    nvidia-driver/fm
    Resetting modules:
    nvidia-driver
    
    Transaction Summary
    =============================================
    
    Is this ok [y/N]: y
    Code block. Example of the status initialization result of the NVIDIA Driver DNF module.

  4. Enable the NVDIA Driver module.

    Color mode
    dnf module enable nvidia-driver:535-open
    dnf module enable nvidia-driver:535-open
    code block. NVDIA Driver module activation
    Color mode
    Updating Subscription Management repositories.
    Last metadata expiration check: 0:04:22 ago on Wed 19 Nov 2025 01:23:48 AM EST.
    Dependencies resolved.
    =============================================
    Package Architecture Version Repository Size
    =============================================
    Enabling module streams:
    nvidia-driver 535-open
    
    Transaction Summary
    =============================================
    
    Is this ok [y/N]: y
    Updating Subscription Management repositories.
    Last metadata expiration check: 0:04:22 ago on Wed 19 Nov 2025 01:23:48 AM EST.
    Dependencies resolved.
    =============================================
    Package Architecture Version Repository Size
    =============================================
    Enabling module streams:
    nvidia-driver 535-open
    
    Transaction Summary
    =============================================
    
    Is this ok [y/N]: y
    Code block. Example of NVDIA Driver module activation result

  5. Check the libnvdia-nscq module list.

    Color mode
    dnf list libnvidia-nscq-535 --showduplicates
    dnf list libnvidia-nscq-535 --showduplicates
    Code block. Check libnvdia-nscq module list

  6. Install libnvdia-nscq.

    Color mode
    dnf install libnvidia-nscq-535-535.183.06-1
    dnf install libnvidia-nscq-535-535.183.06-1
    code block. libnvdia-nscq installation command

NVSwitch Device Monitoring API(NVSDM) Library Installation (for RHEL)

Reference
For GPU architectures after Blackwell, installing the NVSDM Library is required. NVDIA Driver versions below 560 do not provide the NVSDM Library.
  1. Check the NVSDM library module list.

    Color mode
    dnf list libnvsdm --showduplicates
    dnf list libnvsdm --showduplicates
    Code block. Check NVSDM library module list
    Color mode
    libnvsdm:
      Installed: (none)
      Candidate: 580.105.08-1
      Version table:
         580.105.08-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         580.95.05-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         580.82.07-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         580.65.06-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
    libnvsdm:
      Installed: (none)
      Candidate: 580.105.08-1
      Version table:
         580.105.08-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         580.95.05-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         580.82.07-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         580.65.06-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
    Code block. Example of NVSDM library module list verification results.

  2. Install libnvsdm.

    Color mode
    dnf install libnvsdm-580.105.08-1
    dnf install libnvsdm-580.105.08-1
    Code block. Install NVSDM library
    Color mode
    Updating Subscription Management repositories.
    Last metadata expiration check: 0:08:18 ago on Wed 19 Nov 2025 01:05:28 AM EST.
    Dependencies resolved.
    =========================================================================
    Package Architecture Version Repository Size
    =========================================================================
    Installing:
    libnvsdm x86_64 580.105.08-1 cuda-rhel8-x86_64 675 k
    Installing dependencies:
    infiniband-diags x86_64 48.0-1.el8 rhel-8-for-x86_64-baseos-rpms 323 k
    libibumad x86_64 48.0-1.el8 rhel-8-for-x86_64-baseos-rpms 34 k
    
    Transaction Summary
    =========================================================================
    Install 3 Packages
    
    Total download size: 1.0 M
    Installed size: 3.2 M
    Is this ok [y/N]: y
    Updating Subscription Management repositories.
    Last metadata expiration check: 0:08:18 ago on Wed 19 Nov 2025 01:05:28 AM EST.
    Dependencies resolved.
    =========================================================================
    Package Architecture Version Repository Size
    =========================================================================
    Installing:
    libnvsdm x86_64 580.105.08-1 cuda-rhel8-x86_64 675 k
    Installing dependencies:
    infiniband-diags x86_64 48.0-1.el8 rhel-8-for-x86_64-baseos-rpms 323 k
    libibumad x86_64 48.0-1.el8 rhel-8-for-x86_64-baseos-rpms 34 k
    
    Transaction Summary
    =========================================================================
    Install 3 Packages
    
    Total download size: 1.0 M
    Installed size: 3.2 M
    Is this ok [y/N]: y
    Code block. Example of the NVSDM library installation command result

Install NVIDIA DCGM (for RHEL)

Install Node Exporter according to the steps below.

  1. DCGM(datacenter-gpu-manager) Installation
  2. datacenter-gpu-manager-exporter installation
  3. DCGM Service Activation and Start

DCGM(datacenter-gpu-manager) Installation (for RHEL)

refers to a specific version of NVIDIA’s Data Center GPU Manager (DCGM) tool, which is a package for managing and monitoring NVIDIA data center GPUs. In particular, cuda12 indicates that this management tool is installed for the CUDA 12 version, and datacenter-gpu-manager-4 refers to the 4.x version of DCGM. This tool provides various features, including GPU status monitoring, diagnostics, alert system, and power/clock management.

  1. Add the CUDA repository to DNF.
    Color mode
    dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
    dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
    code block. Add DNF Repository
  2. Check the CUDA version.
    Color mode
    nvidia-smi | grep CUDA
    nvidia-smi | grep CUDA
    Code block. Check CUDA version
    Color mode
    | NVIDIA-SMI 535.183.06             Driver Version: 535.183.06     CUDA Version: 12.2 |
    | NVIDIA-SMI 535.183.06             Driver Version: 535.183.06     CUDA Version: 12.2 |
    Code block. Example of CUDA version check result
    Color mode
    CUDA_VERSION=12
    CUDA_VERSION=12
    Code block. CUDA version setting command
  3. Check the list of datacenter-gpu-manager-cuda modules.
    Color mode
    dnf list datacenter-gpu-manager-4-cuda${CUDA_VERSION} --showduplicates
    dnf list datacenter-gpu-manager-4-cuda${CUDA_VERSION} --showduplicates
    Code block. Check the datacenter-gpu-manager-cuda module list
    Color mode
    Updating Subscription Management repositories.
    Unable to read consumer identity
    
    This system is not registered with an entitlement server. You can use subscription-manager to register.
    
    Last metadata expiration check: 0:00:34 ago on Wed 19 Nov 2025 12:26:56 AM EST.
    Available Packages
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.0.0-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.1.0-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.1.1-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.2.0-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.2.2-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.2.3-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.2.3-2    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.3.0-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.3.1-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.4.0-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.4.1-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.4.2-1    cuda-rhel8-x86_64
    Updating Subscription Management repositories.
    Unable to read consumer identity
    
    This system is not registered with an entitlement server. You can use subscription-manager to register.
    
    Last metadata expiration check: 0:00:34 ago on Wed 19 Nov 2025 12:26:56 AM EST.
    Available Packages
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.0.0-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.1.0-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.1.1-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.2.0-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.2.2-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.2.3-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.2.3-2    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.3.0-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.3.1-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.4.0-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.4.1-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.4.2-1    cuda-rhel8-x86_64
    Code block. Example of checking the list of datacenter-gpu-manager-cuda modules.
  4. Install datacenter-gpu-manager-cuda.
    Color mode
    dnf install datacenter-gpu-manager-4-cuda${CUDA_VERSION}
    dnf install datacenter-gpu-manager-4-cuda${CUDA_VERSION}
    Code block. Install datacenter-gpu-manager-cuda
    Color mode
    Updating Subscription Management repositories.
    Unable to read consumer identity
    
    This system is not registered with an entitlement server. You can use subscription-manager to register.
    
    Last metadata expiration check: 0:07:12 ago on Wed 19 Nov 2025 12:26:56 AM EST.
    Dependencies resolved.
    ===================================================================================================
     Package                                       Architecture   Version     Repository          Size
    ===================================================================================================
    Installing:
     datacenter-gpu-manager-4-cuda12               x86_64         1:4.4.2-1   cuda-rhel8-x86_64   554 M
    Installing dependencies:
     datacenter-gpu-manager-4-core                 x86_64         1:4.4.2-1   cuda-rhel8-x86_64   9.9 M
    Installing weak dependencies:
     datacenter-gpu-manager-4-proprietary          x86_64         1:4.4.2-1   cuda-rhel8-x86_64   5.3 M
     datacenter-gpu-manager-4-proprietary-cuda12   x86_64         1:4.4.2-1   cuda-rhel8-x86_64   289 M
    
    Transaction Summary
    ====================================================================================================
    Install  4 Packages
    ...
    Is this ok [y/N]: y
    Updating Subscription Management repositories.
    Unable to read consumer identity
    
    This system is not registered with an entitlement server. You can use subscription-manager to register.
    
    Last metadata expiration check: 0:07:12 ago on Wed 19 Nov 2025 12:26:56 AM EST.
    Dependencies resolved.
    ===================================================================================================
     Package                                       Architecture   Version     Repository          Size
    ===================================================================================================
    Installing:
     datacenter-gpu-manager-4-cuda12               x86_64         1:4.4.2-1   cuda-rhel8-x86_64   554 M
    Installing dependencies:
     datacenter-gpu-manager-4-core                 x86_64         1:4.4.2-1   cuda-rhel8-x86_64   9.9 M
    Installing weak dependencies:
     datacenter-gpu-manager-4-proprietary          x86_64         1:4.4.2-1   cuda-rhel8-x86_64   5.3 M
     datacenter-gpu-manager-4-proprietary-cuda12   x86_64         1:4.4.2-1   cuda-rhel8-x86_64   289 M
    
    Transaction Summary
    ====================================================================================================
    Install  4 Packages
    ...
    Is this ok [y/N]: y
    Code block. Example of datacenter-gpu-manager-cuda installation result

datacenter-gpu-manager-exporter installation (for RHEL)

It is a tool that, based on NVIDIA Data Center GPU Manager (DCGM), collects various GPU metrics such as GPU usage, memory usage, temperature, and power consumption, and exposes them for use in monitoring systems like Prometheus.

  1. Add the CUDA Repository to DNF. 1. (If you have already performed this command, proceed to the next step.)

    Color mode
    dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
    dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
    Code block. Add DNF Repository

  2. Check the CUDA version. 2. (If you have already performed this command, proceed to the next step.)

    Color mode
    nvidia-smi | grep CUDA
    nvidia-smi | grep CUDA
    Code block. Check CUDA version
    Color mode
    | NVIDIA-SMI 535.183.06             Driver Version: 535.183.06     CUDA Version: 12.2 |
    | NVIDIA-SMI 535.183.06             Driver Version: 535.183.06     CUDA Version: 12.2 |
    Code block. Example of CUDA version check result
    Color mode
    CUDA_VERSION=12
    CUDA_VERSION=12
    Code block. CUDA version setting command

  3. Check the datacenter-gpu-manager-exporter module list.

    Color mode
    dnf list datacenter-gpu-manager-exporter --showduplicates
    dnf list datacenter-gpu-manager-exporter --showduplicates
    Code block. Check the list of datacenter-gpu-manager-exporter modules
    Color mode
    Updating Subscription Management repositories.
    Unable to read consumer identity
    
    This system is not registered with an entitlement server. You can use subscription-manager to register.
    
    Last metadata expiration check: 0:02:11 ago on Wed 19 Nov 2025 12:26:56 AM EST.
    Available Packages
    datacenter-gpu-manager-exporter.x86_64   4.0.1-1   cuda-rhel8-x86_64
    datacenter-gpu-manager-exporter.x86_64   4.1.0-1   cuda-rhel8-x86_64
    datacenter-gpu-manager-exporter.x86_64   4.1.1-1   cuda-rhel8-x86_64
    datacenter-gpu-manager-exporter.x86_64   4.1.3-1   cuda-rhel8-x86_64
    datacenter-gpu-manager-exporter.x86_64   4.5.0-1   cuda-rhel8-x86_64
    datacenter-gpu-manager-exporter.x86_64   4.5.1-1   cuda-rhel8-x86_64
    datacenter-gpu-manager-exporter.x86_64   4.5.2-1   cuda-rhel8-x86_64
    datacenter-gpu-manager-exporter.x86_64   4.6.0-1   cuda-rhel8-x86_64
    Updating Subscription Management repositories.
    Unable to read consumer identity
    
    This system is not registered with an entitlement server. You can use subscription-manager to register.
    
    Last metadata expiration check: 0:02:11 ago on Wed 19 Nov 2025 12:26:56 AM EST.
    Available Packages
    datacenter-gpu-manager-exporter.x86_64   4.0.1-1   cuda-rhel8-x86_64
    datacenter-gpu-manager-exporter.x86_64   4.1.0-1   cuda-rhel8-x86_64
    datacenter-gpu-manager-exporter.x86_64   4.1.1-1   cuda-rhel8-x86_64
    datacenter-gpu-manager-exporter.x86_64   4.1.3-1   cuda-rhel8-x86_64
    datacenter-gpu-manager-exporter.x86_64   4.5.0-1   cuda-rhel8-x86_64
    datacenter-gpu-manager-exporter.x86_64   4.5.1-1   cuda-rhel8-x86_64
    datacenter-gpu-manager-exporter.x86_64   4.5.2-1   cuda-rhel8-x86_64
    datacenter-gpu-manager-exporter.x86_64   4.6.0-1   cuda-rhel8-x86_64
    code block. Example of checking the module list of datacenter-gpu-manager-exporter

  4. Install datacenter-gpu-manager-cuda. dcgm-exporter 4.5.X requires glibc 2.34 or newer, but because RHEL 9 provides glibc 2.34, we install it by specifying version 4.1.3-1.

    Color mode
    dnf install datacenter-gpu-manager-exporter-4.1.3-1
    dnf install datacenter-gpu-manager-exporter-4.1.3-1
    Code block. Install datacenter-gpu-manager-cuda
    Color mode
    Updating Subscription Management repositories.
    Unable to read consumer identity
    
    This system is not registered with an entitlement server. You can use subscription-manager to register.
    
    Last metadata expiration check: 0:07:12 ago on Wed 19 Nov 2025 12:26:56 AM EST.
    Dependencies resolved.
    ====================================================================================================
     Package                                       Architecture   Version     Repository          Size
    ====================================================================================================
    Installing:
     datacenter-gpu-manager-exporter               x86_64         4.1.3-1     cuda-rhel8-x86_64   26 M
    
    
    Is this ok [y/N]: y
    Updating Subscription Management repositories.
    Unable to read consumer identity
    
    This system is not registered with an entitlement server. You can use subscription-manager to register.
    
    Last metadata expiration check: 0:07:12 ago on Wed 19 Nov 2025 12:26:56 AM EST.
    Dependencies resolved.
    ====================================================================================================
     Package                                       Architecture   Version     Repository          Size
    ====================================================================================================
    Installing:
     datacenter-gpu-manager-exporter               x86_64         4.1.3-1     cuda-rhel8-x86_64   26 M
    
    
    Is this ok [y/N]: y
    Code block. Example of datacenter-gpu-manager-cuda installation result
    Color mode
    cat /usr/lib/systemd/system/nvidia-dcgm-exporter.service | grep ExecStart
    cat /usr/lib/systemd/system/nvidia-dcgm-exporter.service | grep ExecStart
    Code block. datacenter-gpu-manager-exporter configuration file
    Color mode
    ExecStart=/usr/bin/dcgm-exporter -f /etc/dcgm-exporter/default-counters.csv
    ExecStart=/usr/bin/dcgm-exporter -f /etc/dcgm-exporter/default-counters.csv
    Code block. Example of the result of checking the datacenter-gpu-manager-exporter configuration file.

  5. When installing the DCGM Exporter, review the provided configuration and remove # for required metrics, and add # for unnecessary metrics.

    Color mode
    vi /etc/dcgm-exporter/default-counters.csv
    ## Example ##
    ...
    DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active.
    DCGM_FI_PROF_DRAM_ACTIVE,        gauge, Ratio of cycles the device memory interface is active sending or receiving data.
    # DCGM_FI_PROF_PIPE_FP64_ACTIVE,   gauge, Ratio of cycles the fp64 pipes are active.
    # DCGM_FI_PROF_PIPE_FP32_ACTIVE,   gauge, Ratio of cycles the fp32 pipes are active.
    ...
    vi /etc/dcgm-exporter/default-counters.csv
    ## Example ##
    ...
    DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active.
    DCGM_FI_PROF_DRAM_ACTIVE,        gauge, Ratio of cycles the device memory interface is active sending or receiving data.
    # DCGM_FI_PROF_PIPE_FP64_ACTIVE,   gauge, Ratio of cycles the fp64 pipes are active.
    # DCGM_FI_PROF_PIPE_FP32_ACTIVE,   gauge, Ratio of cycles the fp32 pipes are active.
    ...
    Code block. Example of datacenter-gpu-manager-exporter metric configuration

Reference
Refer to DCGM Exporter Metrics for the metrics that can be collected with the GPU DCGM Exporter and how to configure them.
Caution
Since metric collection through the ServiceWatch Agent is classified as custom metrics and incurs charges unlike the default collected metrics, unnecessary metric collection must be removed or disabled to avoid excessive charges.

Enable and start DCGM service (for RHEL)

  1. Activating and starting the nvdia-dcgm service.

    Color mode
    systemctl enable --now nvidia-dcgm
    systemctl enable --now nvidia-dcgm
    Code block. nvdia-dcgm service activation and start command

  2. Enable and start the nvdia-dcgm-exporter service.

    Color mode
    systemctl enable --now nvidia-dcgm-exporter
    systemctl enable --now nvidia-dcgm-exporter
    Code block. nvdia-dcgm-exporter service activation and start command

guide
If you have completed the DCGM Exporter configuration, you must install the Open Telemetry Collector provided by ServiceWatch and finish setting up the ServiceWatch Agent.
For more details, refer to ServiceWatch > Using ServiceWatch Agent.

DCGM Exporter metric

DCGM Exporter Key Metrics

Among the metrics provided by the DCGM Exporter, the main GPU metrics are as follows.

CategoryDCGM FieldPrometheus Metric TypeSummary
ClocksDCGM_FI_DEV_SM_CLOCKgaugeSM clock frequency (in MHz)
ClocksDCGM_FI_DEV_MEM_CLOCKgaugeMemory clock frequency (in MHz)
TemperatureDCGM_FI_DEV_GPU_TEMPgaugeGPU temperature (in C)
PowerDCGM_FI_DEV_POWER_USAGEgaugePower draw (in W)
UtilizationDCGM_FI_DEV_GPU_UTILgaugeGPU utilization (in %)
UtilizationDCGM_FI_DEV_MEM_COPY_UTILgaugeMemory utilization (in %)
Memory UsageDCGM_FI_DEV_FB_FREEgaugeFrame buffer memory free (in MiB)
Memory UsageDCGM_FI_DEV_FB_USEDgaugeFrame buffer memory used (in MiB)
NvlinkDCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL(8 GPU only)counterTotal number of NVLink bandwidth counters for all lanes
Table. Major GPU metrics provided by DCGM Exporter

DCGM Exporter metric collection configuration

Please refer to the default metrics of DCGM Exporter at DCGM Exporter > 기본 지표.

  • Besides the default configuration, remove # from default-counters.csv for any additional metrics you want to set.
  • For default metrics you do not wish to collect, add # or delete the corresponding entry.
Color mode
# Format
# If line starts with a '#' it is considered a comment
# DCGM FIELD, Prometheus metric type, help message

# Clocks
DCGM_FI_DEV_SM_CLOCK, gauge, SM clock frequency (in MHz).
DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).

# Temperature
DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).
DCGM_FI_DEV_GPU_TEMP,    gauge, GPU temperature (in C).

# Power
DCGM_FI_DEV_POWER_USAGE,              gauge, Power draw (in W).
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ).

# PCIE
# DCGM_FI_PROF_PCIE_TX_BYTES,  counter, Total number of bytes transmitted through PCIe TX via NVML.
# DCGM_FI_PROF_PCIE_RX_BYTES,  counter, Total number of bytes received through PCIe RX via NVML.
...
# Format
# If line starts with a '#' it is considered a comment
# DCGM FIELD, Prometheus metric type, help message

# Clocks
DCGM_FI_DEV_SM_CLOCK, gauge, SM clock frequency (in MHz).
DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).

# Temperature
DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).
DCGM_FI_DEV_GPU_TEMP,    gauge, GPU temperature (in C).

# Power
DCGM_FI_DEV_POWER_USAGE,              gauge, Power draw (in W).
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ).

# PCIE
# DCGM_FI_PROF_PCIE_TX_BYTES,  counter, Total number of bytes transmitted through PCIe TX via NVML.
# DCGM_FI_PROF_PCIE_RX_BYTES,  counter, Total number of bytes received through PCIe RX via NVML.
...
code block. default-counters.csv configuration example