1 - Overview

Service Overview

GPU Server is a virtualized computing service that lets you freely allocate and use infrastructure resources such as CPU, GPU, and memory provided by the server, without having to purchase them individually, and allocate as much as needed at the required time. It is suitable for tasks that require fast computation speed, such as AI model experimentation, prediction, and inference in a cloud environment, and you can flexibly select and use resources with optimized performance based on the type and scale of the work. The GPU Server provides the following features.

Provided Features

  • GPU Server Management: Through a web-based console, users can directly Self Service create, delete, and modify GPU Server provisioning, monitoring, and billing.
  • Product offerings by GPU quantity: Depending on the project’s purpose and scale, you can freely select the number of H100/A100 GPUs to configure a virtual server.
  • High‑Performance GPU Provision: We provide a high‑performance GPU server at physical‑server level using a pass‑through method.
  • Storage Connection: Provides additional attached storage besides the OS disk. * You can connect and use Block Storage, File Storage, and Object Storage.
  • Strong Security Application: Use the Security Group service to control inbound/outbound traffic exchanged with the external internet or other VPC(Virtual Private Cloud), securely protecting the server.
  • Monitoring: You can view monitoring information such as the status of computing resources—including CPU, Memory, Disk, and GPU—through the Cloud Monitoring service.
  • Network Configuration Management: The server’s subnet/IP can be easily changed from the values set at initial creation. * NAT IP provides a management feature that lets you enable or disable it as needed.
  • Key Pair method: To ensure a secure OS access method, we provide a Key Pair method instead of ID/PW login.
  • Image Management: You can create and manage Custom Images, and it provides sharing functionality between projects.
  • ServiceWatch Service Integration Offering: You can monitor data through the ServiceWatch service.

Component

GPU Server provides GPUs, NVSwitch, and NVLink on top of virtualized computing resources.

caution
  • NVSwitch can only be enabled and used for instance types that allocate eight GPUs on a single GPU server.

Specifications by GPU Type

GPU (Graphic Processing Unit) performs the calculations needed to create images that compose the computer screen, and because it is specialized for parallel processing, it can handle large amounts of data quickly, processing large-scale parallel operations such as artificial intelligence (AI) and data analysis.

The following are the specifications of the GPU Types offered by the GPU Server service.

CategoryA100 TypeH100 TypeB300 Type
GPU ArchitectureNVIDIA AmpereNVIDIA HopperNVIDIA Blackwell Ultra
GPU Memory80 GiB80 GiB268 GiB
GPU Transistors54 billion 7N TSMC80 billion 4N TSMC208 billion 4NP TSMC
FP16 Tensor Core (Dense)312 TFLOPs989 TFLOPs2.25 PFLOPs
FP8 Tensor Core (Dense)Not supported1,979 TFLOPs4.5 PFLOPs
FP4 Tensor Core (Dense)UnsupportedNot supported13.5 PFLOPs
GPU Memory Bandwidth2,039 GB/s HBM2e3,352 GB/s HBM38 TB/s HBM3e
NVLink performanceNVLink 3NVLink 4NVLink 5
NVLink Signaling Rate25 GB/s (x12)25 GB/s (x18)50 GB/s (x18)
NVSwitch GPU-to-GPU bandwidth600 GB/s900 GB/s1.8 TB/s
Total NVSwitch aggregate bandwidth4.8 TB/s7.2 TB/s14.4 TB/s
Table. Specifications by GPU Type

Server type

The server types offered by the GPU Server are as follows. For detailed information about the server types provided by GPU Server, refer to GPU Server 서버 타입.

CategoryServer typeCPU vCoreMemory(GB)Number of GPUs
GPU-A100-1g1v16a1162341
GPU-A100-1g1v32a2324682
GPU-A100-1g1v64a4649364
GPU-A100-1g1v128a81281,8728
GPU-H100-2g2v12h1122341
GPU-H100-2g2v24h2244682
GPU-H100-2g2v48h4489364
GPU-H100-2g2v96h8961,8728
GPU-B300-3g3v16b1164801
GPU-B300-3g3v32b2329602
GPU-B300-3g3v64b4641,9204
GPU-B300-3g3v128b81283,8408
Table. GPU Server server type

OS and GPU driver version

The operating systems (OS) supported by the GPU Server are as follows. Note that GPUs of type B300 are supported only from a specific GPU version onward, so please be careful when selecting images.

OSOS versionGPU driver versionServer type classification
Ubuntu24.04580.126.20GPU-B300-3, GPU-H100-2, GPU-A100-1
Ubuntu24.04570.195.03GPU-H100-2, GPU-A100-1
Ubuntu22.04535.183.06GPU-H100-2, GPU-A100-1
RHEL9.6580.126.20GPU-B300-3, GPU-H100-2, GPU-A100-1
RHEL8.10580.126.20GPU-B300-3, GPU-H100-2, GPU-A100-1
RHEL8.10535.183.06GPU-H100-2, GPU-A100-1
Table. GPU Server OS and GPU driver version

Preceding Service

This is a service that must be installed in advance before creating this service. Please prepare by referring to the user guide provided in advance.

Service CategoryserviceDetailed description
NetworkingVPCA service that provides an isolated virtual network in a cloud environment
NetworkingSecurity GroupVirtual firewall that controls server traffic
Table. GPU Server Preliminary Service

1.1 - Server type

GPU Server server type

GPU servers are categorized by the GPU type they provide, and the GPU used in a GPU server is determined by the server type selected when creating the GPU server. Please select the server type based on the specifications of the application you want to run on the GPU server.

The server types supported by the GPU Server are as follows.

GPU-H100-2 g2v12h1
Category
exampleDetailed description
Server typeGPU-H100-2Provided server type classification
  • GPU-H100-2
    • GPU-H100 indicates the provided GPU type
    • 2 indicates the generation
  • GPU-A100-1
    • GPU-A100 indicates the provided GPU type
    • 1 indicates the generation
Server specificationsg2Provided server type classification and generation
  • g2
    • g means GPU server specification
    • 2 means generation
Server specificationsv12Number of vCores
  • v2: 2 virtual cores
Server specificationsh1GPU type and quantity
  • h1
    • h means GPU-H100
    • 1 means one GPU
  • a2
    • a means GPU-A100
    • 2 means two GPUs
Table. GPU Server server type format

g1 server type

The g1 server type is a GPU Server that uses the NVIDIA A100 Tensor Core GPU, suitable for high-performance applications.

  • Up to 8 NVIDIA A100 Tensor Core GPUs provided
  • Equipped with 6,912 CUDA cores and 432 Tensor cores per GPU
  • Supports up to 128 vCPUs and 1,920 GB of memory
  • Networking speed up to 40 Gbps
  • 600 GB/s GPU and NVIDIA NVSwitch P2P communication
CategoryServer typeGPUCPUMemoryGPU MemoryNetwork Bandwidth
GPU-A100-1g1v16a1116 vCore234 GB80 GiBMaximum 20 Gbps
GPU-A100-1g1v32a2232 vCore468 GB160 GiBUp to 20 Gbps
GPU-A100-1g1v64a4464 vCore936 GB320 GiBMaximum 40 Gbps
GPU-A100-1g1v128a88128 vCore1,872 GB640 GiBMaximum 40 Gbps
Table. GPU Server server type > GPU-A100-1 server type

g2 server type

The g2 server type is a GPU Server that uses the NVIDIA H100 Tensor Core GPU, making it suitable for high-performance applications.

  • Up to 8 NVIDIA H100 Tensor Core GPUs provided
  • Equipped with 16,896 CUDA cores and 528 Tensor cores per GPU
  • Supports up to 96 vCPUs and 1,920 GB of memory
  • Networking speed up to 40 Gbps
  • 900 GB/s GPU and NVIDIA NVSwitch P2P communication
CategoryServer typeGPUCPUMemoryGPU MemoryNetwork Bandwidth
GPU-H100-2g2v12h1112 vCore234 GB80 GiBUp to 20 Gbps
GPU-H100-2g2v24h2224 vCore468 GB160 GiBUp to 20 Gbps
GPU-H100-2g2v48h4448 vCore936 GB320 GiBMaximum 40 Gbps
GPU-H100-2g2v96h8896 vCore1,872 GB640 GiBMaximum 40 Gbps
Table. GPU Server server type > GPU-H100-2 server type

g3 server type

The g3 server type is a GPU Server that uses the NVIDIA B300 Tensor Core GPU, making it suitable for high-performance applications.

  • Up to 8 NVIDIA B300 Tensor Core GPUs provided
  • Equipped with 20,480 CUDA cores and 640 Tensor cores per GPU
  • Supports up to 128 vCPUs and 3,840 GB of memory.
  • Maximum 40 Gbps networking speed
  • 1.8 TB/s GPU and NVIDIA NVSwitch P2P communication
CategoryServer typeGPUCPUMemoryGPU MemoryNetwork Bandwidth
GPU-B300-3g3v16b1116 vCore480 GB268 GiBUp to 20 Gbps
GPU-B300-3g3v32b2232 vCore960 GB536 GiBUp to 20 Gbps
GPU-B300-3g3v64b4464 vCore1,920 GB1,072 GiBUp to 40 Gbps
GPU-B300-3g3v128b88128 vCore3,840 GB2,144 GiBMaximum 40 Gbps
Table. GPU Server server type > GPU-B300-3 server type

1.2 - Monitoring Metrics

Cloud Monitoring service termination notice

According to Samsung Cloud Platform’s policy, the Cloud Monitoring service is scheduled to be discontinued in September 2026.
Accordingly, after the September 2026 release, resource monitoring of the Samsung Cloud Platform via Cloud Monitoring will no longer be possible.

With the new alternative service, you can continuously perform resource monitoring by leveraging ServiceWatch released in October 2025.
ServiceWatch provides more modern and powerful features, replacing Cloud Monitoring to deliver a seamless monitoring environment.

If you are collecting metrics and logs through the Cloud Monitoring Agent, you need to switch to the ServiceWatch Agent.

For detailed information about ServiceWatch, please refer to ServiceWatch Overview.
Detailed information about ServiceWatch Agent can be found in the ServiceWatch Agent.

GPU Server Monitoring Metrics

The table below shows the monitoring metrics of the GPU server that can be viewed through Cloud Monitoring.

Even without installing the Agent, it provides basic monitoring metrics and the table below. Please check the GPU Server monitoring metrics (provided by default). Additionally, the metrics that can be viewed by installing the Agent are in the table. Please refer to the additional monitoring metrics for GPU Server (Agent installation required).

For detailed usage of Cloud Monitoring, refer to the Cloud Monitoring guide.

Performance Item NameExplanationunit
Memory Total [Basic]bytes of usable memorybytes
Memory Used [Basic]bytes of currently used memorybytes
Memory Swap In [Basic]bytes of the replaced memorybytes
Memory Swap Out [Basic]bytes of the replaced memorybytes
Memory Free [Basic]bytes of unused memorybytes
Disk Read Bytes [Basic]Read bytesbytes
Disk Read Requests [Basic]Number of read requestscnt
Disk Write Bytes [Basic]write bytesbytes
Disk Write Requests [Basic]Number of write requestscnt
CPU Usage [Basic]Average system CPU usage over 1 minute%
Instance State [Basic]Instance statusstate
Network In Bytes [Basic]Received bytesbytes
Network In Dropped [Basic]Incoming packet dropcnt
Network In Packets [Basic]Number of received packetscnt
Network Out Bytes [Basic]sent bytesbytes
Network Out Dropped [Basic]Transmit packet dropcnt
Network Out Packets [Basic]Number of transmitted packetscnt
Table. GPU Server Basic Monitoring Metrics (Provided by Default)
Performance Item NameExplanationunit
GPU CountNumber of GPUscnt
GPU Memory UsageMemory usage rate%
GPU Memory UsedMemory usageMB
GPU TemperatureGPU temperature
GPU Usageutilization%
GPU Usage [Avg]Overall average GPU utilization (%)%
GPU Power CapMaximum power capacity of the GPUW
GPU Power UsageCurrent GPU power usageW
GPU Memory Usage [Avg]GPU Memory Uti. AVG%
GPU Count in useNumber of GPUs in use by jobs on the nodecnt
Execution Status for nvidia-smiResult of running the nvidia-smi commandstatus
Core Usage [IO Wait]Ratio of CPU time spent in wait state (disk wait)%
Core Usage [System]Proportion of CPU time spent in kernel space%
Core Usage [User]Proportion of CPU time spent in user space%
CPU CoresNumber of CPU cores on the hostcnt
CPU Usage [Active]Percentage of CPU time used, excluding Idle and IOWait states%
CPU Usage [Idle]It is the proportion of CPU time spent in idle state.%
CPU Usage [IO Wait]The proportion of CPU time spent in a waiting state (disk wait).%
CPU Usage [System]Percentage of CPU time used by the kernel%
CPU Usage [User]Percentage of CPU time used in user space.%
CPU Usage/Core [Active]Percentage of CPU time used other than Idle and IOWait states%
CPU Usage/Core [Idle]It is the proportion of CPU time spent in idle state.%
CPU Usage/Core [IO Wait]This is the proportion of CPU time spent in a waiting state (disk wait).%
CPU Usage/Core [System]Percentage of CPU time used by the kernel%
CPU Usage/Core [User]Percentage of CPU time used in user space.%
Disk CPU Usage [IO Request]Proportion of CPU time during which I/O requests to the device were executed%
Disk Queue Size [Avg]The average queue length of requests executed for the device.num
Disk Read BytesThe number of bytes read per second from the device.bytes
Disk Read Bytes [Delta Avg]Average of system.diskio.read.bytes_delta for individual disksbytes
Disk Read Bytes [Delta Max]Maximum system.diskio.read.bytes_delta of individual disksbytes
Disk Read Bytes [Delta Min]Minimum system.diskio.read.bytes_delta of individual disksbytes
Disk Read Bytes [Delta Sum]Sum of the system.diskio.read.bytes_delta of individual disksbytes
Disk Read Bytes [Delta]Delta of the system.diskio.read.bytes value for each diskbytes
Disk Read Bytes [Success]Total number of bytes successfully read.bytes
Disk Read RequestsNumber of read requests to the disk device per secondcnt
Disk Read Requests [Delta Avg]Average of the system.diskio.read.count_delta for individual diskscnt
Disk Read Requests [Delta Max]Maximum system.diskio.read.count_delta for individual diskscnt
Disk Read Requests [Delta Min]Minimum of system.diskio.read.count_delta for individual diskscnt
Disk Read Requests [Delta Sum]Sum of system.diskio.read.count_delta for individual diskscnt
Disk Read Requests [Success Delta]Delta of system.diskio.read.count for each diskcnt
Disk Read Requests [Success]Total number of successful readscnt
Disk Request Size [Avg]The average size of requests executed on the device (unit: sectors).num
Disk Service Time [Avg]Average service time (milliseconds) of input requests executed on the device.ms
Disk Wait Time [Avg]Average time taken for requests executed on the supported device.ms
Disk Wait Time [Read]Average disk wait timems
Disk Wait Time [Write]Average disk wait timems
Disk Write Bytes [Delta Avg]Average of system.diskio.write.bytes_delta for each diskbytes
Disk Write Bytes [Delta Max]Maximum system.diskio.write.bytes_delta of individual disksbytes
Disk Write Bytes [Delta Min]Minimum of system.diskio.write.bytes_delta for individual disksbytes
Disk Write Bytes [Delta Sum]Sum of system.diskio.write.bytes_delta for individual disksbytes
Disk Write Bytes [Delta]Delta of the system.diskio.write.bytes value for each diskbytes
Disk Write Bytes [Success]Total number of bytes successfully written.bytes
Disk Write RequestsNumber of write requests to the disk device per secondcnt
Disk Write Requests [Delta Avg]Average of system.diskio.write.count_delta for individual diskscnt
Disk Write Requests [Delta Max]Maximum system.diskio.write.count_delta for individual diskscnt
Disk Write Requests [Delta Min]Minimum of system.diskio.write.count_delta for individual diskscnt
Disk Write Requests [Delta Sum]Sum of the system.diskio.write.count_delta of individual diskscnt
Disk Write Requests [Success Delta]Delta of system.diskio.write.count for each diskcnt
Disk Write Requests [Success]Total number of successful writescnt
Disk Writes BytesIt is the number of bytes per second written to the device.bytes
Filesystem Hang Checkfilesystem (local/NFS) hang check (normal:1, abnormal:0)status
Filesystem NodesIt is the total number of file nodes in the file system.cnt
Filesystem Nodes [Free]It is the total number of available file nodes in the file system.cnt
Filesystem Size [Available]Disk space (bytes) available to unauthorized usersbytes
Filesystem Size [Free]Available disk space (bytes)bytes
Filesystem Size [Total]Total disk space (bytes)bytes
Filesystem UsageUsed disk space percentage%
Filesystem Usage [Avg]Average of individual filesystem.used.pct values%
Filesystem Usage [Inode]inode usage%
Filesystem Usage [Max]Maximum among individual filesystem.used.pct%
Filesystem Usage [Min]minimum among individual filesystem.used.pct%
Filesystem Usage [Total]-%
Filesystem UsedUsed disk space (bytes)bytes
Filesystem Used [Inode]inode usagebytes
Memory FreeTotal amount of available memory (bytes).bytes
Memory Free [Actual]Actual usable memory (bytes).bytes
Memory Free [Swap]Available swap memory.bytes
Memory Totaltotal memorybytes
Memory Total [Swap]Total swap memory.bytes
Memory UsagePercentage of used memory%
Memory Usage [Actual]Percentage of memory actually used%
Memory Usage [Cache Swap]cached swap usage rate%
Memory Usage [Swap]Percentage of used swap memory%
Memory Usedused memorybytes
Memory Used [Actual]Actual memory used (bytes).bytes
Memory Used [Swap]Swap memory used.bytes
CollisionsNetwork collisioncnt
Network In BytesNumber of received bytesbytes
Network In Bytes [Delta Avg]Average of system.network.in.bytes_delta for individual networksbytes
Network In Bytes [Delta Max]Maximum of system.network.in.bytes_delta for each networkbytes
Network In Bytes [Delta Min]Minimum system.network.in.bytes_delta for each networkbytes
Network In Bytes [Delta Sum]Sum of system.network.in.bytes_delta for individual networksbytes
Network In Bytes [Delta]Delta of received byte countbytes
Network In DroppedNumber of deleted packets among incoming packetscnt
Network In ErrorsNumber of errors during receptioncnt
Network In PacketsNumber of received packetscnt
Network In Packets [Delta Avg]Average of system.network.in.packets_delta for each networkcnt
Network In Packets [Delta Max]Maximum of system.network.in.packets_delta for each networkcnt
Network In Packets [Delta Min]Minimum of system.network.in.packets_delta for individual networkscnt
Network In Packets [Delta Sum]Sum of system.network.in.packets_delta for individual networkscnt
Network In Packets [Delta]Delta of received packet countcnt
Network Out BytesNumber of transmitted bytesbytes
Network Out Bytes [Delta Avg]Average of system.network.out.bytes_delta for each networkbytes
Network Out Bytes [Delta Max]Maximum of system.network.out.bytes_delta for individual networksbytes
Network Out Bytes [Delta Min]Minimum of system.network.out.bytes_delta for individual networksbytes
Network Out Bytes [Delta Sum]Sum of system.network.out.bytes_delta for individual networksbytes
Network Out Bytes [Delta]Delta of transmitted byte countbytes
Network Out DroppedNumber of deleted packets among outgoing packets.cnt
Network Out ErrorsNumber of errors during transmissioncnt
Network Out PacketsNumber of transmitted packetscnt
Network Out Packets [Delta Avg]Average of system.network.out.packets_delta for each networkcnt
Network Out Packets [Delta Max]Maximum of system.network.out.packets_delta for each networkcnt
Network Out Packets [Delta Min]Minimum of system.network.out.packets_delta for each networkcnt
Network Out Packets [Delta Sum]Sum of system.network.out.packets_delta for individual networkscnt
Network Out Packets [Delta]Delta of transmitted packet countcnt
Open Connections [TCP]All open TCP connectionscnt
Open Connections [UDP]All open UDP connectionscnt
Port UsageAvailable port usage rate%
SYN Sent SocketsNumber of sockets in SYN_SENT state (when connecting from local to remote)cnt
Kernel PID Maxkernel.pid_max valuecnt
Kernel Thread Maxkernel.threads-max valuecnt
Process CPU UsagePercentage of CPU time consumed by the process since the last update.%
Process CPU Usage/CorePercentage of CPU time used by the process since the last event.%
Process Memory UsageProportion of main memory (RAM) occupied by a process%
Process Memory UsedResident Set size. The amount of memory a process occupies in RAM.bytes
Process PIDprocess pidPID
Process PPIDparent process PIDPID
Processes [Dead]Number of dead processescnt
Processes [Idle]Number of idle processescnt
Processes [Running]Number of running processescnt
Processes [Sleeping]Number of sleeping processescnt
Processes [Stopped]stopped processes countcnt
Processes [Total]Total number of processescnt
Processes [Unknown]Number of processes with an unsearchable or unknown statuscnt
Processes [Zombie]Zombie processes countcnt
Running Process Usageprocess usage rate%
Running ProcessesNumber of running processescnt
Running Thread UsageThread usage rate%
Running ThreadsTotal number of threads running in running processescnt
Context Switchescontext switch count (per second)cnt
Load/Core [1 min]The load over the last 1 minute divided by the number of corescnt
Load/Core [15 min]The load over the last 15 minutes divided by the number of corescnt
Load/Core [5 min]The load over the last 5 minutes divided by the number of corescnt
Multipaths [Active]External storage connection path status = active countcnt
Multipaths [Failed]External storage connection path status = failed countcnt
Multipaths [Faulty]External storage connection path status = faulty countcnt
NTP Offsetmeasured offset of the last sample (the time difference between the NTP server and the local environment)num
Run Queue LengthExecution queue lengthnum
UptimeOS uptime (milliseconds).ms
Context SwitchiesCPU context switch count (per second)cnt
Disk Read Bytes [Sec]Number of bytes read from a Windows logical disk in 1 secondcnt
Disk Read Time [Avg]Average data read time (seconds)sec
Disk Transfer Time [Avg]Disk average wait timesec
Disk UsageDisk usage%
Disk Write Bytes [Sec]Number of bytes written in one second on a Windows logical diskcnt
Disk Write Time [Avg]Average data write time (seconds)sec
Pagingfile UsagePaging file usage%
Pool Used [Non Paged]Nonpaged Pool usage in kernel memorybytes
Pool Used [Paged]Paged Pool usage in kernel memorybytes
Process [Running]Number of currently running processescnt
Threads [Running]Number of currently running threadscnt
Threads [Waiting]Number of threads waiting for processor timecnt
Table. Additional monitoring metrics for GPU Server (Agent installation required)

1.3 - ServiceWatch Metrics

The GPU Server sends metrics to ServiceWatch. The metrics provided by default monitoring are data collected at 5‑minute intervals. If detailed monitoring is enabled, you can view data collected at 1‑minute intervals.

information
  • The basic and detailed monitoring of the GPU Server are provided with the same metrics as the Virtual Server, and the namespace is also provided as Virtual Server.
  • GPU-related metrics are provided through ServiceWatch Agent, and for instructions on collecting metrics using ServiceWatch Agent, refer to the ServiceWatch Agent guide.
Reference
For how to view metrics in ServiceWatch, refer to the ServiceWatch guide.

Refer to How-to guides > ServiceWatch Detailed Monitoring Activation for how to enable detailed monitoring of the GPU Server.

Basic Metrics

The following are the basic metrics for the Virtual Server namespace.

The indicators whose names are displayed in bold below are the key indicators selected among the basic indicators provided by Virtual Server. The key metrics are used to build service dashboards that are automatically created for each service in ServiceWatch.

Each metric indicates through the user guide which statistical value is meaningful to view for that metric, and among the meaningful statistics, the statistical values shown in bold text are the primary statistics. In the service dashboard, you can view primary metrics using the primary statistical values.

Performance itemsDetailed descriptionunitmeaningful statistics
Instance StateInstance status display
  • 1 - Active
  • 0 - Off
None
  • Total
CPU UsageCPU usagePercent
  • Average
  • Maximum
  • Minimum
Disk Read BytesBytes read from block device (bytes)Bytes
  • Total
  • Average
  • Maximum
  • Minimum
Disk Read RequestsNumber of read requests on a block deviceCount
  • Total
  • Average
  • Maximum
  • Minimum
Disk Write BytesWrite capacity (bytes) on block deviceBytes
  • Total
  • Average
  • Maximum
  • Minimum
Disk Write RequestsNumber of write requests on block deviceCount
  • Total
  • Average
  • Maximum
  • Minimum
Network In BytesReceived bytes (capacity) on the network interfaceBytes
  • Total
  • Average
  • Maximum
  • Minimum
Network In DroppedNumber of packet drops received on the network interfaceCount
  • Total
  • Average
  • Maximum
  • Minimum
Network In PacketsNumber of packets received on the network interfaceCount
  • Total
  • Average
  • Maximum
  • Minimum
Network Out BytesData transmitted on the network interface (bytes)Bytes
  • Total
  • Average
  • Maximum
  • Minimum
Network Out DroppedNumber of packet drops transmitted from the network interfaceCount
  • Total
  • Average
  • Maximum
  • Minimum
Network Out PacketsNumber of packets transmitted on the network interfaceCount
  • Total
  • Average
  • Maximum
  • Minimum
Table. Virtual Server Basic Metrics

2 - How-to guides

Users can create the service by entering the required GPU Server information and selecting detailed options through the Samsung Cloud Platform Console.

Creating a GPU Server

You can create and use a GPU Server service from the Samsung Cloud Platform Console.

To create a GPU server, follow the steps below.

  1. Click the All Services > Compute > GPU Server menu. You will be taken to the Service Home page of GPU Server.

  2. On the Service Home page, click the GPU Server Creation button. You will be taken to the GPU Server Creation page.

  3. On the GPU Server Creation page, enter the information required to create the service and select detailed options.

    • Select the required information in the Image and Version Selection area.
      Category
      Required status
      Detailed description
      ImageRequiredSelect the type of Image provided
      • Standard: Samsung Cloud Platform standard provided Image
        • RHEL, Ubuntu
      • Custom: User-created Image
      • Kubernetes: Image for Kubernetes
        • Ubuntu
      Image versionRequiredSelect the version of the chosen Image
      • Provides a list of versions for the offered server Image
      Table. GPU Server image and version selection input fields
    • Service Information Input area, enter or select the required information.
      Category
      Required status
      Detailed description
      Number of serversRequiredNumber of GPU Server servers to create concurrently
      • Only numeric input is allowed, enter a value between 1 and 100
      Service Type > Server TypeRequiredGPU Server server type
      • Indicates the specifications of a GPU-type server, allowing selection of servers with 1, 2, 4, or 8 GPUs
      Service Type > Planned ComputeSelectionResource status with Planned Compute configured
      • In Use: Number of resources with Planned Compute that are currently in use
      • Configured: Number of resources with Planned Compute configured
      • Coverage Preview: Amount applied per resource by Planned Compute
      • Apply for Planned Compute Service: Navigate to the Planned Compute service application page
      Block StorageRequiredConfigure the Block Storage used by the GPU Server according to its purpose
      • Basic: The area where the OS is installed and used
        • Capacity can be entered in units (the minimum capacity varies depending on the OS image type)
          • RHEL: Values between 3 and 1,536 can be entered
          • Ubuntu: Values between 3 and 1,536 can be entered
        • SSD: High‑performance general volume
        • HDD: General volume
        • SSD/HDD_KMS: Additional encrypted volume using Samsung Cloud Platform KMS (Key Management System) encryption keys
          • Encryption can be applied only at initial creation (cannot be changed after creation)
          • Performance degradation occurs when using the SSD_KMS disk type
        • SSD_Provisioned: SSD volume with configurable IOPS and throughput
      • Additional: Used when the user needs extra space beyond the OS area
        • After selecting Use, enter the storage type and capacity
        • To add storage, click the + button (up to 25 can be added); to delete, click the x button
        • Capacity can be entered in units, with values between 1 and 1,536
          • Since 1 Unit equals 8 GB, this creates 8 GB to 12,288 GB
        • SSD: High‑performance general volume
        • HDD: General volume
        • SSD/HDD_KMS: Additional encrypted volume using Samsung Cloud Platform KMS (Key Management System) encryption keys
          • Encryption can be applied only at initial creation (cannot be changed after creation)
          • Performance degradation may occur when using the SSD_KMS disk type
        • HDD/SSD_MultiAttach: Volume that can be attached to two or more servers
        • SSD_Provisioned: SSD volume with configurable IOPS and throughput
      • Delete on termination: If Delete on Termination is set to Enabled, the volume is terminated together with the server
        • Volumes with existing snapshots are not deleted even if Delete on termination is set to Enabled
        • A multi‑attach volume can be deleted only when the server being removed is the last remaining server attached to the volume
      Max IOPSRequiredEnter a maximum IOPS value between 5,000~20,000
      • Can be set only when disk type is SSD_Provisioned
      Max ThroughputRequiredEnter the maximum Throughput value between 250~1,000
      • Disk type can be set only when it is SSD_Provisioned
      Table. GPU Server Service Configuration Items
    • In the Required Information Input area, enter or select the necessary information.
      Category
      required or not
      Detailed description
      Server nameRequiredEnter a name to distinguish the server when the selected number of servers is 1
      • Set the hostname using the entered server name
      • Enter within 63 characters using letters, numbers, spaces, and special characters (- _)
      Server name PrefixRequiredEnter a Prefix to distinguish each server generated when the selected number of servers is 2 or more
      • Automatically generated in the form of user input value (prefix) + ‘-#
      • Enter within 59 characters using letters, numbers, spaces, and special characters (-, _)
      Network Settings > Create New Network PortRequiredConfigure the network where the GPU Server will be installed
      • Select a pre‑created VPC.
      • General Subnet: Select a pre‑created general Subnet
        • IP can be set to auto‑generate or manual entry; if manual is chosen, the user can directly input the IP
        • NAT: Available only when there is a single server and the VPC is attached to an Internet Gateway. Checking Use enables selection of a NAT IP
        • NAT IP: Select a NAT IP
          • If no NAT IP is available, click the Create New button to generate a Public IP
          • Click the Refresh button to view and select the created Public IP
          • Creating a Public IP incurs charges according to the Public IP pricing policy
      • Local Subnet (optional): Choose Use for a local Subnet
        • It is not a required element for creating the service
        • A pre‑created local Subnet must be selected
        • IP can be set to auto‑generate or manual entry; selecting Manual allows the user to input the IP directly
        • Security Group: Settings required to access the server
          • Select: Choose a pre‑created Security Group
          • Create New: If no applicable Security Group exists, you can create one separately in the Security Group service
          • Up to 5 can be selected
          • If no Security Group is set, all connections are blocked by default
          • You must configure a Security Group to allow required connections
      Network Settings > Existing Network Port AssignmentRequiredSet the network where the GPU server will be installed
      • Select a pre-created VPC
      • General Subnet: Select a pre-created general Subnet and Port
        • NAT: Available only when there is a single server and the VPC is connected to an Internet Gateway. When you check to use it, you can select a NAT IP.
        • NAT IP: Select a NAT IP
          • If there is no NAT IP to select, click the Create New button to generate a Public IP
          • Click the Refresh button to view and select the created Public IP
      • Local Subnet (optional): Select Use for the local Subnet
        • Select a pre-created local Subnet and Port
      KeypairRequiredUser authentication methods to use when connecting to the server
      • New creation: Create a new keypair if one is needed
      • Default login accounts by OS
        • RHEL: cloud-user
        • Ubuntu: ubuntu
      Table. Required input fields for GPU Server
    • Enter or select the required information in the Additional Information Input area.
      Category
      Required
      Detailed description
      LockSelectionLock usage setting
      • When Lock is enabled, it prevents actions such as server termination, start, and stop from being executed, thereby avoiding malfunctions caused by mistakes
      Init scriptSelectionScript executed when the server starts
      • The init script must be written as a Batch script for Windows, a Shell script for Linux, or cloud‑init, depending on the image type.
      • Up to 45,000 bytes can be entered
      tagSelectionAdd Tag
      • Up to 50 can be added per resource
      • After clicking the Add Tag button, enter or select Key, Value values
      Table. GPU Server additional information input fields
  4. Summary Check the detailed information and estimated billing amount generated in the panel, and click the Create button.

    • Once creation is complete, check the created resources on the GPU Server List page.

Check GPU Server detailed information

GPU Server service allows you to view and edit the full resource list and detailed information. GPU Server Details page consists of Details, Tags, Job History tabs.

To view detailed information about the GPU Server service, follow these steps.

  1. Click the All Services > Compute > GPU Server menu. You will be taken to the Service Home page of GPU Server.
  2. On the Service Home page, click the GPU Server menu. You will be taken to the GPU Server List page.
  3. On the GPU Server List page, click the resource you want to view details for. You will be taken to the GPU Server Details page.
    • GPU Server Details page displays status information and additional feature information, and is composed of Details, Tags, Activity History tabs.
    • GPU Server Additional Features for detailed information, please refer to GPU Server Management Additional Features.
      CategoryDetailed description
      GPU Server statusStatus of user-created GPU Server
      • Build: State where the Build command has been received
      • Building: Build in progress
      • Networking: Server creation networking process
      • Scheduling: Server creation scheduling process
      • Block_Device_Mapping: Connecting Block Storage during server creation
      • Spawning: Server creation process is ongoing
      • Active: Available state
      • Powering_off: State when a stop request is made
      • Deleting: Server deletion in progress
      • Reboot_Started: Reboot in progress
      • Error: Error state
      • Migrating: Server migrating to another host
      • Reboot: State where the Reboot command has been received
      • Rebooting: Rebooting
      • Rebuild: State where the Rebuild command has been received
      • Rebuilding: State during Rebuild request
      • Rebuild_Spawning: Rebuild process is ongoing
      • Resize: State where the Resize command has been received
      • Resizing: Resizing in progress
      • Resize_Prep: State when a server type change is requested
      • Resize_Migrating: Server is migrating to another host while resizing
      • Resize_Migrated: Server has completed migration to another host during resizing
      • Resize_Finish: Resize completed
      • Revert_Resize: Resize or migration of the server failed for some reason. The target server is cleaned up and the original server is restarted
      • Shutoff: State when powering off is completed
      • Verity_Resize: After Resize_Prep following a server type change request, the server type is confirmed or can be reverted
      • Resize_Reverting: State when a server type revert is requested
      • Resize_Confirming: State confirming the server’s Resize request
      Server controlButtons to change server status
      • Start: start a stopped server
      • Stop: stop a running server
      • Restart: restart a running server
      Image generationCreate a custom user image from the current server image
      Console logView console logs of the current server
      • You can view the console logs output by the current server. For more details, see Check console logs.
      Create dumpGenerate a dump of the current server
      • The dump file is created inside the GPU Server
      • For detailed dump creation instructions, refer to Create Dump
      RebuildAll data and settings of the existing server are deleted, and a new server is configured
      Service terminationCancel service button
      Table. GPU Server status information and additional features
Notice
When using the mig feature, after the GPU Server’s Rebooting state has finished, you need to recheck the mig settings.

Detailed Information

GPU Server List page lets you view detailed information of the selected resource and edit it if needed.

CategoryDetailed description
serviceService name
Resource TypeResource Type
SRNUnique resource ID in Samsung Cloud Platform
  • In the GPU Server service, it refers to the GPU Server SRN
Resource nameResource Name
  • In the GPU Server service, it refers to the GPU Server name
Resource IDUnique resource ID in the service
constructorUser who created the service
Creation date and timeService creation date and time
editorUser who edited the service information
Modification dateDate and time the service information was modified
Server nameServer name
Server typevCPU, memory, GPU information display
  • If you need to change to a different server type, click the Edit button to configure
image nameService OS image and version
LockIndicates whether Lock is used or not
  • If you need to change the Lock attribute value, click the Edit button to set it
Keypair nameServer authentication information set by the user
Planned ComputeResource status with Planned Compute configured
LLM EndpointURL for using LLM
ServiceWatch Detailed MonitoringWhen enabled, data monitoring is possible in the ServiceWatch service
  • You can set the activation status by clicking the Edit button
NetworkNetwork information of the GPU Server
  • VPC name, standard Subnet name, IP, NAT IP, NAT IP status, Security Group name
  • If you need to change the NAT IP value, click the Edit button to configure
  • If you need to change the Security Group, click the Edit button to configure
  • Add as new network: select a standard Subnet and IP
    • You can select a different standard Subnet within the same VPC
    • IP can be either auto-generated or manually entered; if you choose manual entry, you can input the IP directly
  • Add using existing port: select a pre-created standard Subnet and port
Local SubnetLocal Subnet information of the GPU Server
  • Local Subnet name, Local Subnet IP, Security Group name
  • If you need to change the Security Group, you can configure it by clicking the Edit button
  • Add to New Network: select a local Subnet and IP
    • You can select a different local Subnet within the same VPC
    • IP can be Auto-generated or manually entered; selecting Input allows the user to directly enter the IP
  • Add Existing Port: select a pre-created local Subnet and port
Block StorageInformation of Block Storage attached to the server
  • Volume ID, Volume Name, Type, Capacity, Connection Info, Category, Delete on termination, Status
  • Add: Additional Block Storage can be attached if needed
  • Edit Delete on termination: Modify the Delete on termination value
  • More > Detach: Detach the selected Block Storage from the list
Table. GPU Server detailed information tab items
Caution
When using ServiceWatch detailed monitoring, additional fees apply.

tag

GPU Server List page allows you to view the tag information of the selected resource, and you can add, modify, or delete it.

CategoryDetailed description
Tag listTag list
  • You can view the Key and Value information of the tag
  • Up to 50 tags can be added per resource
  • When entering a tag, you can search and select from the list of previously created Keys and Values
Table. GPU Server Tag Tab Items

Job History

You can view the job history of the selected resource on the GPU Server List page.

CategoryDetailed description
Task History ListResource Change History
  • Operation Time, Resource ID, Resource Name, Operation Details, Event Topic, Operation Result, Check Operator Information
Table. Work History Tab Detailed Information Items

Control GPU Server Operation

If you need to control the operation of a created GPU Server resource, you can perform the task on the GPU Server List or GPU Server Details page. You can start, stop, and restart a running server.

Getting Started with GPU Server

You can start a shutoff GPU Server. To start a GPU Server, follow the steps below.

  1. Click the All Services > Compute > GPU Server menu. You will be taken to the Service Home page of the GPU Server.
  2. On the Service Home page, click the GPU Server menu. You will be taken to the GPU Server List page.
  3. On the GPU Server List page, click the resource you want to start among the shutoff servers, and navigate to the GPU Server Details page.
    • GPU Server List page allows you to Start each resource via the right More button.
    • After selecting multiple servers with checkboxes, you can control multiple servers simultaneously using the Start button at the top.
  4. GPU Server Details page, click the Start button at the top to start the server. In the Status Display section, verify the updated server status.
    • When the GPU Server startup is complete, the server status changes from Shutoff to Active.
    • For detailed information about the GPU Server status, refer to GPU Server detailed information.

Stopping the GPU Server

You can stop a running (Active) GPU Server. To stop the GPU Server, follow the steps below.

  1. Click the All Services > Compute > GPU Server menu. You will be taken to the Service Home page of GPU Server.
  2. On the Service Home page, click the GPU Server menu. You will be taken to the GPU Server List page.
  3. On the GPU Server List page, click the resource to stop among the servers that are running (Active), and navigate to the GPU Server Details page.
    • On the GPU Server List page, you can Stop each resource via the right More button.
    • After selecting multiple servers with checkboxes, you can control multiple servers simultaneously using the Stop button at the top.
  4. GPU Server Details page, click the Stop button at the top to start the server. Check the changed server status in the Status Display item.
    • When the GPU server shutdown is complete, the server status changes from Active to Shutoff.
    • For detailed information about the GPU Server status, please refer to GPU Server Detailed Information.

Restart GPU Server

You can restart the created GPU Server. To restart the GPU Server, follow the steps below.

  1. Click the All Services > Compute > GPU Server menu. You will be taken to the Service Home page of GPU Server.
  2. On the Service Home page, click the GPU Server menu. You will be taken to the GPU Server List page.
  3. On the GPU Server List page, click the resource to restart and go to the GPU Server Details page.
    • On the GPU Server list page, you can restart each resource using the right More button.
    • After selecting multiple servers with checkboxes, you can control multiple servers simultaneously using the Restart button at the top.
  4. GPU Server Details on the page, click the Restart button at the top to start the server. Check the updated server status in the Status Display field.
    • During a GPU Server restart, the server status goes through Rebooting and finally changes to Active.
    • For detailed information about the GPU Server status, refer to GPU Server detailed information.

GPU Server resource management

If you need server control and management functions for the created GPU Server resources, you can perform the tasks on the GPU Server Resource List or GPU Server Details page.

Create Image

You can create an image of a running GPU server.

Reference

This guide explains how to create a custom user image from the image of a running GPU server.

  • On the GPU Server List or GPU Server Details page, click the Create Image button to create a user Custom Image.

To create an image of the GPU Server, follow these steps.

  1. Click the All Services > Compute > GPU Server menu. Navigate to the Service Home page of the GPU Server.

  2. On the Service Home page, click the GPU Server menu. You will be taken to the GPU Server List page.

  3. On the GPU Server List page, click the resource to create an Image. You will be taken to the GPU Server Details page.

  4. Click the Create Image button on the GPU Server Details page. You will be taken to the Image Creation page.

    • Enter the required information in the Service Information Input area.
      Category
      Required status
      Detailed description
      image nameRequiredEnter the name of the image to be created
      • using English letters, numbers, spaces, and special characters (- _) within 200 characters
      Table. Image service information input fields
  5. Check the input information and click the Create button.

    • When creation is complete, check the created resources on the All Services > Compute > GPU Server > Image List page.
Notice
  • When you create an Image, the generated Image is stored in the Object Storage used as internal storage. Therefore, Object Storage usage fees are applied.
  • Since the file system of an image created from an active GPU server cannot be guaranteed to be intact, it is recommended to stop the server before creating the image.

Enabling detailed monitoring for ServiceWatch

By default, the GPU Server is linked to the basic monitoring of the ServiceWatch and Virtual Server namespaces. You can enable detailed monitoring as needed to identify operational issues more quickly and take action. For detailed information about ServiceWatch, see ServiceWatch Overview.

Note
The GPU Server provides basic and detailed monitoring in the same namespace as the Virtual Server. GPU Server’s GPU metrics will be provided by the ServiceWatch Agent. (Planned for December 2025)
Caution
Basic monitoring is provided free of charge, but enabling detailed monitoring incurs additional fees. Please be aware when using it.

To enable detailed ServiceWatch monitoring on the GPU Server, follow these steps.

  1. Click the All Services > Compute > GPU Server menu. You will be taken to the Service Home page of GPU Server.
  2. On the Service Home page, click the GPU Server menu. You will be taken to the GPU Server List page.
  3. On the GPU Server List page, click the resource to enable ServiceWatch detailed monitoring. You will be taken to the GPU Server Details page.
  4. On the GPU Server Details page, click the ServiceWatch detailed monitoring Edit button. You will be taken to the ServiceWatch Detailed Monitoring Edit popup.
  5. ServiceWatch Detailed Monitoring Edit In the popup window, select Enable, review the instructions, and click the Confirm button.
  6. On the GPU Server Details page, check the ServiceWatch detailed monitoring items.

Disable detailed monitoring of ServiceWatch

Caution
For cost efficiency, detailed monitoring needs to be disabled. Keep detailed monitoring enabled only when absolutely necessary, and disable it for the rest.

To disable detailed monitoring of ServiceWatch on the GPU Server, follow these steps.

  1. Click the All Services > Compute > GPU Server menu. You will be taken to the Service Home page of GPU Server.
  2. On the Service Home page, click the GPU Server menu. You will be taken to the GPU Server List page.
  3. GPU Server List page, click the resource to disable ServiceWatch detailed monitoring. Navigate to the GPU Server Details page.
  4. GPU Server Details page, click the ServiceWatch detailed monitoring Edit button. You will be taken to the ServiceWatch detailed monitoring Edit popup.
  5. ServiceWatch Detailed Monitoring Edit In the popup window, after deselecting Enabled, review the guidance message and click the Confirm button.
  6. On the GPU Server Details page, view the ServiceWatch detailed monitoring items.

GPU Server Management Additional Features

GPU Server can view Console logs, generate Dumps, and perform Rebuilds for server management. To view Console logs, generate Dumps, or Rebuild a GPU Server, follow the steps below.

Check console log

You can view the current console log of the GPU Server.

To view the console logs of the GPU server, follow these steps.

  1. Click the All Services > Compute > GPU Server menu. You will be taken to the GPU Server’s Service Home page.
  2. On the Service Home page, click the GPU Server menu. You will be taken to the GPU Server List page.
  3. On the GPU Server List page, click the resource to view the console log. You will be taken to the GPU Server Details page.
  4. On the GPU Server Details page, click the Console Log button. You will be taken to the Console Log popup.
  5. Console Log Verify the console log displayed in the popup window.

Create Dump

To create a dump file on the GPU Server, follow these steps.

  1. Click the All Services > Compute > GPU Server menu. You will be taken to the Service Home page of GPU Server.
  2. On the Service Home page, click the GPU Server menu. You will be taken to the GPU Server List page.
  3. GPU Server List page, click the resource to view detailed information. You will be taken to the GPU Server Details page.
  4. On the GPU Server Details page, click the Create Dump button.
    • The dump file is created inside the GPU server.

Perform Rebuild

You can delete all data and settings of the existing GPU Server and rebuild it on a new server.

Follow these steps to perform a Rebuild of the GPU Server.

  1. Click the All Services > Compute > GPU Server menu. You will be taken to the Service Home page of GPU Server.
  2. On the Service Home page, click the GPU Server menu. You will be taken to the GPU Server List page.
  3. GPU Server List page, click the resource to perform Rebuild. GPU Server Details page will be displayed.
  4. On the GPU Server Details page, click the Rebuild button.
    • During a GPU Server Rebuild, the server status changes to Rebuilding, and when the Rebuild is complete, it returns to its state before the Rebuild.
    • For detailed information about the GPU Server status, refer to Check GPU Server details.

Terminate GPU Server

If you terminate an unused GPU Server, you can reduce operating costs. However, terminating a GPU Server may cause the running service to stop immediately, so you should carefully consider the impact of service interruption before proceeding with the termination.

Caution
Please note that data cannot be recovered after terminating the service.

To cancel the GPU Server, follow the steps below.

  1. Click the All Services > Compute > GPU Server menu. Go to the GPU Server’s Service Home page.
  2. On the Service Home page, click the GPU Server menu. You will be taken to the GPU Server List page.
  3. On the GPU Server List page, select the resource to cancel and click the Cancel Service button.
    • The termination of attached storage depends on the Delete on termination setting, so refer to Termination constraints.
  4. After termination is complete, check on the GPU Server List page whether the resources have been terminated.

Cancellation constraints

When a GPU Server termination request cannot be processed, a popup will provide guidance. Please refer to the cases below.

Cancellation not possible
  • If File Storage is connected please disconnect the File Storage connection first.
  • When the LB Pool is connected, please disconnect the LB Pool connection first.
  • If Lock is set please change the Lock setting to disabled and try again.

Termination of attached storage depends on the Delete on termination setting.

Delete on termination setting-specific deletion
  • Whether the volume is deleted also depends on the Delete on termination setting.
    • Delete on termination If not set: Even if you terminate the GPU Server, the volume will not be deleted.
    • When Delete on termination is set: If you terminate the GPU Server, the associated volume will be deleted.
  • Volumes that have snapshots will not be deleted even if Delete on termination is set.
  • A Multi‑attach volume can be deleted only when the server being deleted is the last remaining server attached to the volume.

2.1 - Manage Image

Users can create the service by entering the required information for the Image service within the GPU Server service and selecting detailed options through the Samsung Cloud Platform Console.

Create Image

You can create an image of a running GPU Server. To create an image of a GPU Server, please refer to Image Creation.

Check Image detailed information

Image service allows you to view and edit the full resource list and detailed information. The Image Details page consists of Detailed Information, Tags, Operation History tabs.

To view detailed information of the Image service, follow these steps.

  1. Click the All Services > Compute > GPU Server menu. You will be taken to the Service Home page of GPU Server.
  2. On the Service Home page, click the Image menu. You will be taken to the Image list page.
  3. On the Image List page, click the resource to view detailed information. You will be taken to the Image Detail page.
    • Image Details page displays status information and additional feature information, and consists of Details, Tags, Activity Log tabs.
      CategoryDetailed description
      Image statusStatus of user-created Image
      • Active: Available state
      • Queued: Image has been uploaded and is waiting for processing after creation
      • Importing: Image has been uploaded and is currently being processed after creation
      Share with another accountImage can be shared with another Account
      • The Image’s Visibility must be set to Shared in order to be shared with another Account
      Delete imageButton to delete the Image
      • Once the Image is deleted, it cannot be restored
      Table. GPU Server Image status information and additional features

Detailed Information

Image list page lets you view detailed information of the selected resource and modify it if necessary.

CategoryDetailed description
serviceService name
Resource TypeResource Type
SRNUnique resource ID in Samsung Cloud Platform
  • refers to the SRN of a GPU Server Image
Resource nameImage name
Resource IDImage ID
constructorUser who created the Image
Creation date and timeImage creation timestamp
editorUser who edited the Image
Modification dateImage modification timestamp
image nameImage name
Minimum diskMinimum disk capacity (GB) of the Image
  • If you need to modify the minimum disk, click the Edit button to set it
Minimum RAMMinimum RAM size (GB) of the Image
OS typeOS type of the image
OS hash algorithmOS hash algorithm method
VisibilityDisplay access permissions for the image
  • Private can be used only within the project, and Shared can be shared across projects
ProtectedSelect whether image deletion is prohibited
  • Checking Use can prevent accidental deletion of images
  • This setting can be changed after the image is created
image file URLImage file URL uploaded when generating an image
  • GPU Server detail page does not display images created through the image generation menu
Sharing statusCurrent status of sharing images with another Account
  • Approved Account ID: ID of the Account for which sharing is approved
  • Modification date and time: The date and time when sharing was requested to another Account; if the sharing status changes from Pending to Accepted, it is updated to that date and time
  • Status: Approved status
    • Accepted: sharing is approved and active
    • Pending: awaiting approval
  • Stop sharing: sharing has been stopped
  • Select the Account ID to stop sharing from the list, then click the Stop sharing button at the top of the list to stop sharing all at once
Table. Image detailed information tab items

tag

On the Image List page, you can view the tag information of the selected resource and add, modify, or delete it.

CategoryDetailed description
Tag listTag list
  • You can view the Key and Value information of the tag
  • Up to 50 tags can be added per resource
  • When entering a tag, you can search and select from the list of previously created Keys and Values
Table. Image tag tab items

Job History

You can view the operation history of the selected resource on the Image List page.

CategoryDetailed description
Task History ListResource Change History
  • Operation Time, Resource ID, Resource Name, Operation Details, Event Topic, Operation Result, Check Operator Information
Table. GPU Server Image Job History Tab Detailed Information Items

Image Resource Management

Describes the control and management functions of the generated Image.

Share to another Account

To share an Image with another Account, follow the steps below.

  1. Log in to the account to be shared and click the All Services > Compute > GPU Server menu. Go to the GPU Server’s Service Home page.
  2. On the Service Home page, click the Image menu. You will be taken to the Image List page.
  3. On the Image List page, click the Image you want to control. You will be taken to the Image Details page.
  4. Click the Share to another Account button. Navigate to the Share image to another Account page.
    • Share with another Account feature allows you to share an Image with another Account. To share an Image with another Account, the Image’s Visibility must be Shared.
  5. Share image to another Account page, enter the required information, and click the Done button.
    Category
    Required
    Detailed description
    image name-Name of the image to share
    • Input not allowed
    Image ID-Shareable image ID
    • Input not allowed
    Shared Account IDRequiredEnter another Account ID to share
    • English letters, numbers, special characters- within 64 characters
    Table. Required input fields for sharing images to another Account
  6. You can view the information in the sharing status of the Image Details page.
    • When the request is first made, the status is Pending, and it changes to Accepted once approval is completed by the account receiving the share.
Notice
Only images created by uploading an image file from the current user can be shared with another Account. If you create a Custom Image from the image of a running GPU Server, it cannot be shared with another Account, and this feature will be provided in the future, so please note.

Receive sharing from another Account

To receive an Image shared from another Account, follow these steps.

  1. Log in to the account to be shared and click the All Services > Compute > GPU Server menu. Navigate to the GPU Server’s Service Home page.
  2. On the Service Home page, click the Image menu. You will be taken to the Image List page.
  3. On the Image List page, click the Receive Image Share button. You will be taken to the Receive Image Share popup.
  4. Receive Image Sharing In the popup window, enter the Image’s resource ID you want to receive, and click the Confirm button.
  5. When image sharing is complete, you can view the shared Image in the Image list.

Delete Image

You can delete unused Images. However, since an Image cannot be recovered after deletion, you should carefully consider the impact before performing the deletion.

Caution
Please note that data cannot be recovered after deleting the service.

To delete the Image, follow these steps.

  1. Click the All Services > Compute > GPU Server menu. Go to the GPU Server’s Service Home page.
  2. On the Service Home page, click the Image menu. You will be taken to the Image List page.
  3. Image list page, select the resource to delete, and click the Delete button.
    • On the Image List page, select multiple Image check boxes and click the Delete button at the top of the resource list.
  4. After deletion is complete, verify on the Image list page that the resource has been removed.

2.2 - Manage Keypair

Users can create the service by entering the required Keypair information within the GPU Server service and selecting detailed options through the Samsung Cloud Platform Console.

Create a Keypair

You can create and use the Keypair service while using the GPU Server service in the Samsung Cloud Platform Console.

To create a keypair, follow these steps.

  1. Click the All Services > Compute > GPU Server menu. You will be taken to the Service Home page of GPU Server.
  2. On the Service Home page, click the Keypair menu. You will be taken to the Keypair List page.
  3. On the Keypair List page, click the Keypair Create button. You will be taken to the Keypair Create page.
    • Enter the required information in the Service Information Input area.
      Category
      Required
      Detailed description
      Keypair nameRequiredEnter the name of the Keypair to create
      • using English letters, numbers, spaces, and special characters (-, _) within 255 characters
      Keypair typeRequiredssh
      Table. Keypair service information input fields
    • Additional Information Input area, please enter or select the required information.
      Category
      Required status
      Detailed description
      tagSelectionAdd Tag
      • Up to 50 can be added per resource
      • After clicking the Add Tag button, enter or select Key, Value values
      Table. Keypair additional information input fields
      Caution
      • After creation is complete, you can download the Key only once. Since reissuance is not possible, make sure it has been downloaded.
      • Store the downloaded Private Key in a safe place.
  4. Check the input information and click the Create button.
    • After creation is complete, check the created resources on the Keypair List page.

View detailed information of the Keypair

The Keypair service allows you to view and edit the full resource list and detailed information. Keypair Details page consists of Details, Tags, Activity Log tabs.

To view detailed information about a keypair, follow these steps.

  1. Click the All Services > Compute > GPU Server menu. You will be taken to the Service Home page of GPU Server.
  2. On the Service Home page, click the Keypair menu. You will be taken to the Keypair List page.
  3. On the Keypair List page, click the resource to view its details. You will be taken to the Keypair Details page.
    • Keypair Details page displays status information and additional feature information, and consists of Details, Tags, Activity Log tabs.

Detailed Information

Keypair List page allows you to view detailed information of the selected resource and edit the information if needed.

CategoryDetailed description
serviceService name
Resource TypeResource Type
SRNUnique resource ID in Samsung Cloud Platform
  • In Keypair, it refers to the Keypair SRN
Resource nameKeypair name
Resource IDKeypair’s unique resource ID
constructorUser who created the keypair
Creation date and timeKeypair creation timestamp
editorUser who modified the keypair information
Modification dateDate and time the keypair information was modified
Keypair nameKeypair name
FingerprintA unique value for identifying the key
User IDUser ID of the keypair creator
public keyPublic key information
Table. Keypair detailed information tab items

tag

On the Keypair List page, you can view the tag information of the selected resource, and add, modify, or delete it.

CategoryDetailed description
Tag listTag list
  • You can view the Key and Value information of the tag
  • Up to 50 tags can be added per resource
  • When entering a tag, you can search and select from the list of previously created Keys and Values
Table. Keypair Tag Tab Items

Job History

On the Keypair List page, you can view the operation history of the selected resource.

CategoryDetailed description
Task History ListResource Change History
  • Operation Time, Resource ID, Resource Name, Operation Details, Event Topic, Operation Result, Check operator information
Table. Keypair operation history tab detailed information items

Keypair Resource Management

Describes the control and management functions of a keypair.

Get public key

To retrieve the public key, follow these steps.

  1. Click the All Services > Compute > GPU Server menu. You will be taken to the Service Home page of GPU Server.

  2. On the Service Home page, click the Keypair menu. You will be taken to the Keypair List page.

  3. On the Keypair List page, click the More button at the top and then click the Import Public Key button. You will be taken to the Import Public Key page.

    • Enter or select the required information in the Required Information Input area.
      Category
      Required
      Detailed description
      Keypair nameRequiredName of the Keypair to create
      Keypair typeRequiredssh
      public keyRequiredEnter public key
      • Load file: Select the Attach file button to attach the public key file
        • Only files with the following extension (.pem) can be attached
      • Enter public key: Paste the copied public key value
        • The public key value can be copied from the Keypair Details page
      Table. Required input fields for retrieving the public key
  4. Review the entered information and click the Complete button.

    • Once creation is complete, check the created resources on the Keypair List page.

Delete Keypair

You can delete unused Keypairs. However, once a Keypair is deleted it cannot be recovered, so please review the impact thoroughly beforehand before proceeding with deletion.

Caution
Please note that data cannot be recovered after deleting the service.

To delete a keypair, follow these steps.

  1. Click the All Services > Compute > GPU Server menu. Go to the Service Home page of GPU Server.
  2. On the Service Home page, click the Keypair menu. You will be taken to the Keypair List page.
  3. On the Keypair List page, select the resource to delete, and click the Delete button.
  • On the Keypair List page, select multiple Keypair check boxes and click the Delete button at the top of the resource list.
  1. After deletion is complete, check the Keypair List page to confirm that the resource has been removed.

2.3 - Use Multi-instance GPU on GPU Server

After creating a GPU Server, you can enable the MIG (Multi-instance GPU) feature on the GPU Server’s VM (Guest OS) and create an instance for use.

NVIDIA Multi-instance GPU Introduction

NVIDIA Multi-instance GPU (hereafter referred to as MIG) supports safely partitioning a GPU into GPU instances and running CUDA applications starting with the NVIDIA Ampere architecture. Through this, multiple users can each utilize different GPU resources to achieve optimal GPU utilization. This feature is especially useful for workloads that do not fully utilize the GPU’s computing capacity, and users can run multiple workloads in parallel to maximize utilization.

Using Multi-instance GPU feature

To use the MIG feature, create an NVIDIA GPU Server on the Samsung Cloud Platform, then enable and disable MIG. The order of applying and removing MIG is as follows.

MIG application order
Enable MIG → Create GPU Instance → Create Compute Instance → Use MIG
MIG release order
Delete Compute Instance → Delete GPU Instance → Disable MIG feature (deactivate)
Reference
  • MIG can be used on Samsung Cloud Platform’s next-generation GPU Server or MNGC (Multi-node GPU Cluster).
  • For system requirements to use MIG, refer to the NVIDIA Multi-Instance GPU User Guide.

Applying and Using MIG

After activating the MIG and creating an Instance to assign tasks, the tasks proceed in the following order.

MIG application order
Enable MIG → Create GPU Instance → Create Compute Instance → Use MIG
Note
The example of applying MIG is explained based on an A100 GPU server.

Activate MIG

  1. Check the GPU status on the VM Instance (GuestOS) before applying MIG.

    • Check whether MIG mode is Disabled.
      Color mode
      $ nvidia-smi
      Mon Sep 27 08:37:08 2021
      +-----------------------------------------------------------------------------+
      | NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4 |
      | -------------------------------+----------------------+---------------------- |
      | GPU  Name        Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
      | Fan  Temp  Perf  Pwr:Usage/Cap | Memory-Usage | GPU-Util  Compute M. |
      |  |  | MIG M. |
      | ===============================+======================+====================== |
      | 0  NVDIA A100-SXM...  Off | 00000000:05:00.0 Off | 0 |
      | N/A   32C   P0    59W / 400W | 0MiB / 81251MiB | 0%      Default |
      |  |  | Disabled |
      +-------------------------------+----------------------+----------------------+
      
      +-----------------------------------------------------------------------------+
      | Processes: |
      | GPU   GI   CI       PID   Type   Process name                   GPU Memory |
      | ID   ID                                                   Usage |
      | ============================================================================= |
      | No running processes found |
      +-----------------------------------------------------------------------------+
      $ nvidia-smi
      Mon Sep 27 08:37:08 2021
      +-----------------------------------------------------------------------------+
      | NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4 |
      | -------------------------------+----------------------+---------------------- |
      | GPU  Name        Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
      | Fan  Temp  Perf  Pwr:Usage/Cap | Memory-Usage | GPU-Util  Compute M. |
      |  |  | MIG M. |
      | ===============================+======================+====================== |
      | 0  NVDIA A100-SXM...  Off | 00000000:05:00.0 Off | 0 |
      | N/A   32C   P0    59W / 400W | 0MiB / 81251MiB | 0%      Default |
      |  |  | Disabled |
      +-------------------------------+----------------------+----------------------+
      
      +-----------------------------------------------------------------------------+
      | Processes: |
      | GPU   GI   CI       PID   Type   Process name                   GPU Memory |
      | ID   ID                                                   Usage |
      | ============================================================================= |
      | No running processes found |
      +-----------------------------------------------------------------------------+
      Code block. nvidia-smi command - Check GPU disabled status (1)
      Color mode
      $ nvidia-smi –L
      GPU 0: NVIDIA A100-SXM-80GB (UUID: GPU-c956838f-494a-92b2-6818-56eb28fe25e0)
      $ nvidia-smi –L
      GPU 0: NVIDIA A100-SXM-80GB (UUID: GPU-c956838f-494a-92b2-6818-56eb28fe25e0)
      Code block. nvidia-smi command - Check GPU disabled status (2)
  2. Enable MIG (Enable) for each GPU on the VM Instance (GuestOS) and reboot the VM Instance.

    Color mode
    $ nvidia-smi –I 0 –mig 1
    Enabled MIG mode for GPU 00000000:05:00.0
    All done.
    
    # reboot
    $ nvidia-smi –I 0 –mig 1
    Enabled MIG mode for GPU 00000000:05:00.0
    All done.
    
    # reboot
    Code block. nvidia-smi command - enable MIG

Reference

When using a GPU and configuring MIG, you may encounter the following warning message. If the warning appears, check whether any programs are running on the GPU.

Warning: MIG mode is in pending enable state for GPU 00000000:05:00.0: In use by another client. 00000000:05:00.0 is currently being used by one or more other processes (e.g. CUDA application or a monitoring application such as another instance of nvidia-smi).
  1. Check the GPU status after applying MIG on the VM Instance(GuestOS).
    • Check whether MIG mode is Enabled.
      Color mode
      $ nvidia-smi
      Mon Sep 27 09:44:33 2021
      +-----------------------------------------------------------------------------+
      | NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4 |
      | -------------------------------+----------------------+---------------------- |
      | GPU  Name        Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
      | Fan  Temp  Perf  Pwr:Usage/Cap | Memory-Usage | GPU-Util  Compute M. |
      |  |  | MIG M. |
      | ===============================+======================+====================== |
      | 0  NVDIA A100-SXM...  Off | 00000000:05:00.0 Off | On |
      | N/A   32C   P0    59W / 400W | 0MiB / 81251MiB | 0%      Default |
      |  |  | Enabled |
      +-------------------------------+----------------------+----------------------+
      +-----------------------------------------------------------------------------+
      | MIG devices: |
      +-----------------------------------------------------------------------------+
      | GPU  GI  CI  MIG | Memory-Usage | Vol | Shared |
      | ID  ID  Dev | BAR1-Usage | SM     Unc | CE  ENC  DEC  OFA  JPG |
      |  |  | ECC |  |
      | ============================================================================= |
      | No MIG devices found |
      +-----------------------------------------------------------------------------+
      +-----------------------------------------------------------------------------+
      | Processes: |
      | GPU   GI   CI       PID   Type   Process name                   GPU Memory |
      | ID   ID                                                   Usage |
      | ============================================================================= |
      | No running processes found |
      +-----------------------------------------------------------------------------+
      $ nvidia-smi
      Mon Sep 27 09:44:33 2021
      +-----------------------------------------------------------------------------+
      | NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4 |
      | -------------------------------+----------------------+---------------------- |
      | GPU  Name        Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
      | Fan  Temp  Perf  Pwr:Usage/Cap | Memory-Usage | GPU-Util  Compute M. |
      |  |  | MIG M. |
      | ===============================+======================+====================== |
      | 0  NVDIA A100-SXM...  Off | 00000000:05:00.0 Off | On |
      | N/A   32C   P0    59W / 400W | 0MiB / 81251MiB | 0%      Default |
      |  |  | Enabled |
      +-------------------------------+----------------------+----------------------+
      +-----------------------------------------------------------------------------+
      | MIG devices: |
      +-----------------------------------------------------------------------------+
      | GPU  GI  CI  MIG | Memory-Usage | Vol | Shared |
      | ID  ID  Dev | BAR1-Usage | SM     Unc | CE  ENC  DEC  OFA  JPG |
      |  |  | ECC |  |
      | ============================================================================= |
      | No MIG devices found |
      +-----------------------------------------------------------------------------+
      +-----------------------------------------------------------------------------+
      | Processes: |
      | GPU   GI   CI       PID   Type   Process name                   GPU Memory |
      | ID   ID                                                   Usage |
      | ============================================================================= |
      | No running processes found |
      +-----------------------------------------------------------------------------+
      Code block. nvidia-smi command - Check GPU activation status (1)
      Color mode
      $ nvidia-smi –L
      GPU 0: NVIDIA A100-SXM-80GB (UUID: GPU-c956838f-494a-92b2-6818-56eb28fe25e0)
      $ nvidia-smi –L
      GPU 0: NVIDIA A100-SXM-80GB (UUID: GPU-c956838f-494a-92b2-6818-56eb28fe25e0)
      Code block. nvidia-smi command - Check GPU activation status (2)

GPU Instance creation

If you have enabled MIG and verified its status, you can create a GPU Instance.

  1. Check the list of MIG GPU Instance profiles that can be created.
    Color mode
    $ nvidia-smi mig -i [GPU ID] -lgip
    $ nvidia-smi mig -i [GPU ID] -lgip
    Code block. nvidia-smi command - view MIG GPU Instance profile list
Color mode
$ nvidia-smi mig -i 0 -lgip
+-----------------------------------------------------------------------------+
| GPU instance profiles: |
| GPU   Name             ID    Instances   Memory     P2P    SM    DEC   ENC |
| Free/Total   GiB              CE    JPEG  OFA |
| ============================================================================= |
| 0 MIG 1g.10gb        19    7/7         9.50       No     14     0     0 |
| 1     0     0 |
+-----------------------------------------------------------------------------+
| 0 MIG 1g.10gb+me     20    1/1         9.50       No     14     0     0 |
| 1     1     1 |
+-----------------------------------------------------------------------------+
| 0 MIG 2g.20gb        14    3/3         19.50      No     28     1     0 |
| 2     0     0 |
+-----------------------------------------------------------------------------+
| 0 MIG 3g.40gb         9    2/2         39.50      No     42     2     0 |
| 3     0     0 |
+-----------------------------------------------------------------------------+
| 0 MIG 4g.40gb         5    1/1         39.50      No     56     2     0 |
| 4     0     0 |
+-----------------------------------------------------------------------------+
| 0 MIG 7g.80gb         0    1/1         79.25      No     98     0     0 |
| 7     1     1 |
+-----------------------------------------------------------------------------+
$ nvidia-smi mig -i 0 -lgip
+-----------------------------------------------------------------------------+
| GPU instance profiles: |
| GPU   Name             ID    Instances   Memory     P2P    SM    DEC   ENC |
| Free/Total   GiB              CE    JPEG  OFA |
| ============================================================================= |
| 0 MIG 1g.10gb        19    7/7         9.50       No     14     0     0 |
| 1     0     0 |
+-----------------------------------------------------------------------------+
| 0 MIG 1g.10gb+me     20    1/1         9.50       No     14     0     0 |
| 1     1     1 |
+-----------------------------------------------------------------------------+
| 0 MIG 2g.20gb        14    3/3         19.50      No     28     1     0 |
| 2     0     0 |
+-----------------------------------------------------------------------------+
| 0 MIG 3g.40gb         9    2/2         39.50      No     42     2     0 |
| 3     0     0 |
+-----------------------------------------------------------------------------+
| 0 MIG 4g.40gb         5    1/1         39.50      No     56     2     0 |
| 4     0     0 |
+-----------------------------------------------------------------------------+
| 0 MIG 7g.80gb         0    1/1         79.25      No     98     0     0 |
| 7     1     1 |
+-----------------------------------------------------------------------------+
Code block. MIG GPU Instance profile list
Reference
Refer to the NVIDIA Multi-Instance GPU User Guide for GPU Instance profiles.
  1. After creating a MIG GPU Instance, check it.
    • Create GPU Instance

      Color mode
      $ nvidia-smi mig -i [GPU ID] -cgi [Profile ID]
      $ nvidia-smi mig -i [GPU ID] -cgi [Profile ID]
      code block. nvidia-smi command - GPU Instance creation
      Color mode
      $ nvidia-smi mig -i 0 -cgi 0
      Successfully created GPU instance ID 0 on GPU 0 using profile MIG 7g.80gb (ID 0)
      $ nvidia-smi mig -i 0 -cgi 0
      Successfully created GPU instance ID 0 on GPU 0 using profile MIG 7g.80gb (ID 0)
      Code block. nvidia-smi command - Example of creating a GPU Instance

    • Check GPU Instance

      Color mode
      $ nvidia-smi mig -i [GPU ID] -lgi
      $ nvidia-smi mig -i [GPU ID] -lgi
      code block. nvidia-smi command - check GPU Instance
      Color mode
      $ nvidia-smi mig -i 0 -lgi
      +--------------------------------------------------------+
      | GPU instances: |
      | GPU   Name               Profile  Instance  Placement |
      | ID       ID      Start:Size |
      | ======================================================== |
      | 0  MIG 7g.80gb            0        0         0:8 |
      +--------------------------------------------------------+
      $ nvidia-smi mig -i 0 -lgi
      +--------------------------------------------------------+
      | GPU instances: |
      | GPU   Name               Profile  Instance  Placement |
      | ID       ID      Start:Size |
      | ======================================================== |
      | 0  MIG 7g.80gb            0        0         0:8 |
      +--------------------------------------------------------+
      Code block. nvidia-smi command - example of checking GPU Instance

Compute Instance creation

If you have created a GPU Instance, you can create a Compute Instance.

  1. Check the MIG Compute Instance profiles you can create.

    Color mode
    $ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] -lcip
    $ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] -lcip
    Code block. nvidia-smi command - Check MIG Compute Instance profile
    Color mode
    $ nvidia-smi mig -i 0 -gi 0 -lcip
    +---------------------------------------------------------------------------------+
    | Compute instance profiles: |
    | GPU     GPU     Name            Profile  Instances   Exclusive      Shared |
    | GPU   Instance                     ID    Free/Total     SM       DEC  ENC  OFA |
    | ID                                                       CE   JPEG |
    | ================================================================================= |
    | 0      0      MIG 1c.7g.80gb     0      7/5           14       5    0    1 |
    | 7    1 |
    +---------------------------------------------------------------------------------+
    | 0      0      MIG 2c.7g.80gb     1      3/3           28       5    0    1 |
    | 7    1 |
    +---------------------------------------------------------------------------------+
    | 0      0      MIG 3c.7g.80gb     2      2/2           42       5    0    1 |
    | 7    1 |
    +---------------------------------------------------------------------------------+
    | 0      0      MIG 4c.7g.80gb     3      1/1           56       5    0    1 |
    | 7    1 |
    +---------------------------------------------------------------------------------+
    | 0      0      MIG 7g.80gb        4*     1/1           98       5    0    1 |
    | 7    1 |
    +---------------------------------------------------------------------------------+
    $ nvidia-smi mig -i 0 -gi 0 -lcip
    +---------------------------------------------------------------------------------+
    | Compute instance profiles: |
    | GPU     GPU     Name            Profile  Instances   Exclusive      Shared |
    | GPU   Instance                     ID    Free/Total     SM       DEC  ENC  OFA |
    | ID                                                       CE   JPEG |
    | ================================================================================= |
    | 0      0      MIG 1c.7g.80gb     0      7/5           14       5    0    1 |
    | 7    1 |
    +---------------------------------------------------------------------------------+
    | 0      0      MIG 2c.7g.80gb     1      3/3           28       5    0    1 |
    | 7    1 |
    +---------------------------------------------------------------------------------+
    | 0      0      MIG 3c.7g.80gb     2      2/2           42       5    0    1 |
    | 7    1 |
    +---------------------------------------------------------------------------------+
    | 0      0      MIG 4c.7g.80gb     3      1/1           56       5    0    1 |
    | 7    1 |
    +---------------------------------------------------------------------------------+
    | 0      0      MIG 7g.80gb        4*     1/1           98       5    0    1 |
    | 7    1 |
    +---------------------------------------------------------------------------------+
    Code block. Example of MIG Compute Instance profile list

  2. Create and verify a MIG Compute Instance.

    • MIG Compute Instance creation
      Color mode
      $ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] -cci [Compute Profile ID]
      $ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] -cci [Compute Profile ID]
      code block. nvidia-smi command - Create MIG Compute Instance
      Color mode
      $ nvidia-smi mig -i 0 -gi 0 -cci 4
      Successfully created compute instance ID 0 on GPU instance ID 0 using profile MIG 7g.80gb(ID 4)
      $ nvidia-smi mig -i 0 -gi 0 -cci 4
      Successfully created compute instance ID 0 on GPU instance ID 0 using profile MIG 7g.80gb(ID 4)
      Code block. nvidia-smi command - Example of creating a MIG Compute Instance
    • Check MIG Compute Instance
      Color mode
      $ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] –lci
      $ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] –lci
      code block. nvidia-smi command - check MIG Compute Instance
      Color mode
      $ nvidia-smi mig -i 0 -gi 0 –lci
      +-----------------------------------------------------------------+
      | Compute instance profiles: |
      | GPU     GPU     Name            Profile  Instances   Placement |
      | GPU   Instance                     ID      ID        Start:Size |
      | ID |
      | ================================================================= |
      | 0      0      MIG 7g.80gb         4       0            0:7 |
      +-----------------------------------------------------------------+
      $ nvidia-smi mig -i 0 -gi 0 –lci
      +-----------------------------------------------------------------+
      | Compute instance profiles: |
      | GPU     GPU     Name            Profile  Instances   Placement |
      | GPU   Instance                     ID      ID        Start:Size |
      | ID |
      | ================================================================= |
      | 0      0      MIG 7g.80gb         4       0            0:7 |
      +-----------------------------------------------------------------+
      Code block. Example of checking MIG Compute Instance
      Color mode
      $ nvidia-smi –L
      GPU 0: NVIDIA A100-SXM-80GB (UUID: GPU-c956838f-494a-92b2-6818-56eb28fe25e0)
        MIG 7g.80gb     Device  0: (UUID: MIG-53e20040-758b-5ecb-948e-c626d03a9a32)
      $ nvidia-smi –L
      GPU 0: NVIDIA A100-SXM-80GB (UUID: GPU-c956838f-494a-92b2-6818-56eb28fe25e0)
        MIG 7g.80gb     Device  0: (UUID: MIG-53e20040-758b-5ecb-948e-c626d03a9a32)
      Code block. nvidia-smi command - Check GPU status (1)
      Color mode
      $ nvidia-smi
      Mon Sep 27 09:52:17 2021
      +-----------------------------------------------------------------------------+
      | NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4 |
      | -------------------------------+----------------------+---------------------- |
      | GPU  Name        Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
      | Fan  Temp  Perf  Pwr:Usage/Cap | Memory-Usage | GPU-Util  Compute M. |
      |  |  | MIG M. |
      | ===============================+======================+====================== |
      | 0  NVDIA A100-SXM...  Off | 00000000:05:00.0 Off | On |
      | N/A   32C   P0    49W / 400W | 0MiB / 81251MiB | N/A      Default |
      |  |  | Enabled |
      +-------------------------------+----------------------+----------------------+
      
      +-----------------------------------------------------------------------------+
      | MIG devices: |
      +-----------------------------------------------------------------------------+
      | GPU  GI  CI  MIG | Memory-Usage | Vol | Shared |
      | ID  ID  Dev | BAR1-Usage | SM     Unc | CE  ENC  DEC  OFA  JPG |
      |  |  | ECC |  |
      | ============================================================================= |
      | 0    0   0    0 | 0MiB / 81251MiB | 98      0 | 7   0    5    1    1 |
      |  | 1MiB / 13107... |  |  |
      +-----------------------------------------------------------------------------+
      +-----------------------------------------------------------------------------+
      | Processes: |
      | GPU   GI   CI       PID   Type   Process name                   GPU Memory |
      | ID   ID                                                   Usage |
      | ============================================================================= |
      | No running processes found |
      +-----------------------------------------------------------------------------+
      $ nvidia-smi
      Mon Sep 27 09:52:17 2021
      +-----------------------------------------------------------------------------+
      | NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4 |
      | -------------------------------+----------------------+---------------------- |
      | GPU  Name        Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
      | Fan  Temp  Perf  Pwr:Usage/Cap | Memory-Usage | GPU-Util  Compute M. |
      |  |  | MIG M. |
      | ===============================+======================+====================== |
      | 0  NVDIA A100-SXM...  Off | 00000000:05:00.0 Off | On |
      | N/A   32C   P0    49W / 400W | 0MiB / 81251MiB | N/A      Default |
      |  |  | Enabled |
      +-------------------------------+----------------------+----------------------+
      
      +-----------------------------------------------------------------------------+
      | MIG devices: |
      +-----------------------------------------------------------------------------+
      | GPU  GI  CI  MIG | Memory-Usage | Vol | Shared |
      | ID  ID  Dev | BAR1-Usage | SM     Unc | CE  ENC  DEC  OFA  JPG |
      |  |  | ECC |  |
      | ============================================================================= |
      | 0    0   0    0 | 0MiB / 81251MiB | 98      0 | 7   0    5    1    1 |
      |  | 1MiB / 13107... |  |  |
      +-----------------------------------------------------------------------------+
      +-----------------------------------------------------------------------------+
      | Processes: |
      | GPU   GI   CI       PID   Type   Process name                   GPU Memory |
      | ID   ID                                                   Usage |
      | ============================================================================= |
      | No running processes found |
      +-----------------------------------------------------------------------------+
      Code block. nvidia-smi command - Check GPU status (2)

Using MIG

  1. Use the MIG Instance to perform the Job.
    • Example of task execution
      Color mode
      $ docker run --gpus '"device=[GPU ID]:[MIG ID]"' -rm nvcr.io/nvidia/cuda nvidia-smi
      $ docker run --gpus '"device=[GPU ID]:[MIG ID]"' -rm nvcr.io/nvidia/cuda nvidia-smi
      Code block. Task execution example
    • You can see an example of the work performed as follows.
      Color mode
      $ docker run --gpus '"device=0:0"' -rm -it --network=host --shm-size=1g --ipc=host -v /root/.ssh/:/root/.ssh
      
      ================
      == TensorFlow ==
      ================
      
      NVIDIA Release 21.08-tf1 (build 26012104)
      TensorFlow Version 1.15.5
      
      Container image Copyright (c) 2021, NVIDIA CORPORATION. All right reserved.
      ...
      
      # Run Python process
      root@d622a93c9281:/workspace# python /workspace/nvidia-examples/cnn/resnet.py --num_iter 100
      ...
      PY 3.8.10 (default, Jun 2 2021, 10:49:15)
      [GCC 9.4.0]
      TF 1.15.5
      ...
      $ docker run --gpus '"device=0:0"' -rm -it --network=host --shm-size=1g --ipc=host -v /root/.ssh/:/root/.ssh
      
      ================
      == TensorFlow ==
      ================
      
      NVIDIA Release 21.08-tf1 (build 26012104)
      TensorFlow Version 1.15.5
      
      Container image Copyright (c) 2021, NVIDIA CORPORATION. All right reserved.
      ...
      
      # Run Python process
      root@d622a93c9281:/workspace# python /workspace/nvidia-examples/cnn/resnet.py --num_iter 100
      ...
      PY 3.8.10 (default, Jun 2 2021, 10:49:15)
      [GCC 9.4.0]
      TF 1.15.5
      ...
      Code block. Operation result
  2. Check the GPU usage. (Create JOB process)
    • When the job runs, you can see that a process is allocated to the MIG device and its utilization increases.
      Color mode
      $ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] -lcip
      $ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] -lcip
      code block. nvidia-smi command - check GPU utilization
    • You can check the GPU usage as shown below.
      Color mode
      +-----------------------------------------------------------------------------+
      | MIG devices: |
      +-----------------------------------------------------------------------------+
      | GPU  GI  CI  MIG | Memory-Usage | Vol | Shared |
      | ID  ID  Dev | BAR1-Usage | SM     Unc | CE  ENC  DEC  OFA  JPG |
      |  |  | ECC |  |
      | ============================================================================= |
      | 0    0   0    0 | 66562MiB / 81251MiB | 98      0 | 7   0    5    1    1 |
      |  | 5MiB / 13107... |  |  |
      +-----------------------------------------------------------------------------+
      +-----------------------------------------------------------------------------+
      | Processes: |
      | GPU   GI   CI       PID   Type   Process name                   GPU Memory |
      | ID   ID                                                   Usage |
      | ============================================================================= |
      | 0     0    0     17483      C   python                           66559MiB |
      +-----------------------------------------------------------------------------+
      +-----------------------------------------------------------------------------+
      | MIG devices: |
      +-----------------------------------------------------------------------------+
      | GPU  GI  CI  MIG | Memory-Usage | Vol | Shared |
      | ID  ID  Dev | BAR1-Usage | SM     Unc | CE  ENC  DEC  OFA  JPG |
      |  |  | ECC |  |
      | ============================================================================= |
      | 0    0   0    0 | 66562MiB / 81251MiB | 98      0 | 7   0    5    1    1 |
      |  | 5MiB / 13107... |  |  |
      +-----------------------------------------------------------------------------+
      +-----------------------------------------------------------------------------+
      | Processes: |
      | GPU   GI   CI       PID   Type   Process name                   GPU Memory |
      | ID   ID                                                   Usage |
      | ============================================================================= |
      | 0     0    0     17483      C   python                           66559MiB |
      +-----------------------------------------------------------------------------+
      Code block. Example of checking GPU utilization.

Delete and release MIG Instance

Follow these steps to delete the MIG instance and detach the MIG.

MIG release order
Delete Compute Instance → Delete GPU Instance → Disable MIG feature (deactivate)

Compute Instance Delete

  • Delete the Compute Instance.
    Color mode
    $ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] –dci
    $ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] -ci [Compute Instance] –dci
    $ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] –dci
    $ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] -ci [Compute Instance] –dci
    Code block. nvidia-smi command - Delete Compute Instance
    Color mode
    $ nvidia-smi mig -i 0 -gi 0 –lci
    +-----------------------------------------------------------------+
    | Compute instance profiles: |
    | GPU     GPU     Name            Profile  Instances   Placement |
    | GPU   Instance                     ID      ID        Start:Size |
    | ID |
    | ================================================================= |
    | 0      0      MIG 7g.80gb         4       0            0:7 |
    +-----------------------------------------------------------------+
    $ nvidia-smi mig -i 0 -gi 0 –lci
    +-----------------------------------------------------------------+
    | Compute instance profiles: |
    | GPU     GPU     Name            Profile  Instances   Placement |
    | GPU   Instance                     ID      ID        Start:Size |
    | ID |
    | ================================================================= |
    | 0      0      MIG 7g.80gb         4       0            0:7 |
    +-----------------------------------------------------------------+
    code block. Example of checking MIG Compute Instance
    Color mode
    $ nvidia-smi mig -i 0 -gi 0 –dci
    Successfully destroyed compute instance ID  0 from GPU instance ID  0
    $ nvidia-smi mig -i 0 -gi 0 –dci
    Successfully destroyed compute instance ID  0 from GPU instance ID  0
    Code block. Compute Instance deletion example
    Color mode
    $ nvidia-smi mig -i 0 -gi 0 –lci
    No compute instances found: Not found
    $ nvidia-smi mig -i 0 -gi 0 –lci
    No compute instances found: Not found
    Code block. Confirm Compute Instance deletion

Delete GPU Instance

  • Delete the GPU Instance.
    Color mode
    $ nvidia-smi mig -i [GPU ID] –dgi
    $ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] –dgi
    $ nvidia-smi mig -i [GPU ID] –dgi
    $ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] –dgi
    Code block. nvidia-smi command - Delete GPU Instance
    Color mode
    $ nvidia-smi mig -i 0 -lgi
    +--------------------------------------------------------+
    | GPU instances: |
    | GPU   Name               Profile  Instance  Placement |
    | ID       ID      Start:Size |
    | ======================================================== |
    | 0  MIG 7g.80gb            0        0         0:8 |
    +--------------------------------------------------------+
    $ nvidia-smi mig -i 0 -lgi
    +--------------------------------------------------------+
    | GPU instances: |
    | GPU   Name               Profile  Instance  Placement |
    | ID       ID      Start:Size |
    | ======================================================== |
    | 0  MIG 7g.80gb            0        0         0:8 |
    +--------------------------------------------------------+
    Code block. nvidia-smi command - Example of checking GPU Instance
    Color mode
    $ nvidia-smi mig -i 0 -dgi
    Successfully destroyed GPU instance ID  0 from GPU  0
    $ nvidia-smi mig -i 0 -dgi
    Successfully destroyed GPU instance ID  0 from GPU  0
    Code block. nvidia-smi command - GPU Instance deletion example
    Color mode
    $ nvidia-smi mig -i 0 -lgi
    No GPU instances found: Not found
    $ nvidia-smi mig -i 0 -lgi
    No GPU instances found: Not found
    code block. nvidia-smi command - example of deleting a GPU Instance

Disable MIG feature (deactivation)

  • After disabling MIG (Disable), reboot.
    Color mode
    $ nvidia-smi -mig 0
    Disabled MIG Mode for GPU 00000000:05:00.0
    
    All done.
    $ nvidia-smi -mig 0
    Disabled MIG Mode for GPU 00000000:05:00.0
    
    All done.
    Code block. nvidia-smi command - disable MIG
    Color mode
    $ nvidia-smi
    Mon Sep 30 05:18:28 2021
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4 |
    | -------------------------------+----------------------+---------------------- |
    | GPU  Name        Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap | Memory-Usage | GPU-Util  Compute M. |
    |  |  | MIG M. |
    | ===============================+======================+====================== |
    | 0  NVDIA A100-SXM...  Off | 00000000:05:00.0 Off | 0 |
    | N/A   33C   P0    60W / 400W | 0MiB / 81251MiB | 0%      Default |
    |  |  | Disabled |
    +-------------------------------+----------------------+----------------------+
    +-----------------------------------------------------------------------------+
    | MIG devices: |
    +-----------------------------------------------------------------------------+
    | GPU  GI  CI  MIG | Memory-Usage | Vol | Shared |
    | ID  ID  Dev | BAR1-Usage | SM     Unc | CE  ENC  DEC  OFA  JPG |
    |  |  | ECC |  |
    | ============================================================================= |
    | No MIG devices found |
    +-----------------------------------------------------------------------------+
    +-----------------------------------------------------------------------------+
    | Processes: |
    | GPU   GI   CI       PID   Type   Process name                   GPU Memory |
    | ID   ID                                                   Usage |
    | ============================================================================= |
    | No running processes found |
    +-----------------------------------------------------------------------------+
    $ nvidia-smi
    Mon Sep 30 05:18:28 2021
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4 |
    | -------------------------------+----------------------+---------------------- |
    | GPU  Name        Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap | Memory-Usage | GPU-Util  Compute M. |
    |  |  | MIG M. |
    | ===============================+======================+====================== |
    | 0  NVDIA A100-SXM...  Off | 00000000:05:00.0 Off | 0 |
    | N/A   33C   P0    60W / 400W | 0MiB / 81251MiB | 0%      Default |
    |  |  | Disabled |
    +-------------------------------+----------------------+----------------------+
    +-----------------------------------------------------------------------------+
    | MIG devices: |
    +-----------------------------------------------------------------------------+
    | GPU  GI  CI  MIG | Memory-Usage | Vol | Shared |
    | ID  ID  Dev | BAR1-Usage | SM     Unc | CE  ENC  DEC  OFA  JPG |
    |  |  | ECC |  |
    | ============================================================================= |
    | No MIG devices found |
    +-----------------------------------------------------------------------------+
    +-----------------------------------------------------------------------------+
    | Processes: |
    | GPU   GI   CI       PID   Type   Process name                   GPU Memory |
    | ID   ID                                                   Usage |
    | ============================================================================= |
    | No running processes found |
    +-----------------------------------------------------------------------------+
    Code block. nvidia-smi command - check GPU status

2.4 - Use NVSwitch on GPU Server

After creating a GPU Server, you can enable the NVSwitch feature on the GPU Server’s VM (Guest OS) and use fast GPU-to-GPU P2P communication.

Caution
Only the GPU Server (8 GPU) and Multi-node GPU Cluster of Samsung Cloud Platform are connected with NVSwitch and NVLink.

Exploring NVIDIA NVSwitch for Multi GPU

NVLink expands I/O by directly connecting multiple GPUs within a server both bidirectionally and GPU-to-GPU. Using NVSwitch, you can connect all GPUs in a server with full NVLink bandwidth.

Checking NVSwitch operation

Check the NVIDIA Fabric Manager, NVIDIA NVLink topology, and NVIDIA NVLink Status on the GPU server.

Reference
The example for checking NVSwitch operation is explained using the A100 GPU Server (g1v128a8) as a reference.

NVIDIA Fabric Manager operating status

Verify that active (running) is displayed when operating normally.

~$ systemctl status nvidia-fabricmanager
Color mode
nvidia-fabricmanager.service - NVIDIA fabric manager service
     Loaded: loaded (/lib/systemd/system/nvidia-fabricmanager.service; enabled; vendor preset: enabled)
     Active: active (running) since Mon 2026-02-02 16:23:27 KST; 32min ago
   Main PID: 2191 (nv-fabricmanage)
      Tasks: 18 (limit: 629145)
     Memory: 18.0M
        CPU: 33.461s
     CGroup: /system.slice/nvidia-fabricmanager.service
             └─2191 /usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabricmanager.cfg
nvidia-fabricmanager.service - NVIDIA fabric manager service
     Loaded: loaded (/lib/systemd/system/nvidia-fabricmanager.service; enabled; vendor preset: enabled)
     Active: active (running) since Mon 2026-02-02 16:23:27 KST; 32min ago
   Main PID: 2191 (nv-fabricmanage)
      Tasks: 18 (limit: 629145)
     Memory: 18.0M
        CPU: 33.461s
     CGroup: /system.slice/nvidia-fabricmanager.service
             └─2191 /usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabricmanager.cfg
Code block. Check NVIDIA Fabric Manager status

Check NVIDIA NVLink topology

Check the NVIDIA NVLink topology.

~$ nvidia-smi topo -m
Color mode
nvidia-smi topo -m
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV12    NV12    NV12    NV12    NV12    NV12    NV12    0-127   0-7             N/A
GPU1    NV12     X      NV12    NV12    NV12    NV12    NV12    NV12    0-127   0-7             N/A
GPU2    NV12    NV12     X      NV12    NV12    NV12    NV12    NV12    0-127   0-7             N/A
GPU3    NV12    NV12    NV12     X      NV12    NV12    NV12    NV12    0-127   0-7             N/A
GPU4    NV12    NV12    NV12    NV12     X      NV12    NV12    NV12    0-127   0-7             N/A
GPU5    NV12    NV12    NV12    NV12    NV12     X      NV12    NV12    0-127   0-7             N/A
GPU6    NV12    NV12    NV12    NV12    NV12    NV12     X      NV12    0-127   0-7             N/A
GPU7    NV12    NV12    NV12    NV12    NV12    NV12    NV12     X      0-127   0-7             N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks
nvidia-smi topo -m
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV12    NV12    NV12    NV12    NV12    NV12    NV12    0-127   0-7             N/A
GPU1    NV12     X      NV12    NV12    NV12    NV12    NV12    NV12    0-127   0-7             N/A
GPU2    NV12    NV12     X      NV12    NV12    NV12    NV12    NV12    0-127   0-7             N/A
GPU3    NV12    NV12    NV12     X      NV12    NV12    NV12    NV12    0-127   0-7             N/A
GPU4    NV12    NV12    NV12    NV12     X      NV12    NV12    NV12    0-127   0-7             N/A
GPU5    NV12    NV12    NV12    NV12    NV12     X      NV12    NV12    0-127   0-7             N/A
GPU6    NV12    NV12    NV12    NV12    NV12    NV12     X      NV12    0-127   0-7             N/A
GPU7    NV12    NV12    NV12    NV12    NV12    NV12    NV12     X      0-127   0-7             N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks
Code block. Check NVIDIA NVLink topology

Check NVIDIA NVLink Status

Check the NVIDIA NVLink Status.

~$ nvidia-smi topo -m
Color mode
GPU 1: NVIDIA A100-SXM4-80GB (UUID: GPU-64a2f685-bb12-c4af-105c-0726ece9c8d7)
         Link 0: 25 GB/s
         Link 1: 25 GB/s
         Link 2: 25 GB/s
         Link 3: 25 GB/s
         Link 4: 25 GB/s
         Link 5: 25 GB/s
         Link 6: 25 GB/s
         Link 7: 25 GB/s
         Link 8: 25 GB/s
         Link 9: 25 GB/s
         Link 10: 25 GB/s
         Link 11: 25 GB/s
GPU 2: NVIDIA A100-SXM4-80GB (UUID: GPU-2269851b-71cd-f6c7-50c5-ba1525cf3ce8)
         Link 0: 25 GB/s
         Link 1: 25 GB/s
         Link 2: 25 GB/s
         Link 3: 25 GB/s
         Link 4: 25 GB/s
         Link 5: 25 GB/s
         Link 6: 25 GB/s
         Link 7: 25 GB/s
         Link 8: 25 GB/s
         Link 9: 25 GB/s
         Link 10: 25 GB/s
         Link 11: 25 GB/s
GPU 3: NVIDIA A100-SXM4-80GB (UUID: GPU-4c397bbf-95fc-5c29-918a-a429cbe45a7a)
         Link 0: 25 GB/s
         Link 1: 25 GB/s
         Link 2: 25 GB/s
         Link 3: 25 GB/s
         Link 4: 25 GB/s
         Link 5: 25 GB/s
         Link 6: 25 GB/s
         Link 7: 25 GB/s
         Link 8: 25 GB/s
         Link 9: 25 GB/s
         Link 10: 25 GB/s
         Link 11: 25 GB/s
GPU 4: NVIDIA A100-SXM4-80GB (UUID: GPU-0e350204-9fb6-2cbe-538e-8f7849658eb8)
         Link 0: 25 GB/s
         Link 1: 25 GB/s
         Link 2: 25 GB/s
         Link 3: 25 GB/s
         Link 4: 25 GB/s
         Link 5: 25 GB/s
         Link 6: 25 GB/s
         Link 7: 25 GB/s
         Link 8: 25 GB/s
         Link 9: 25 GB/s
         Link 10: 25 GB/s
         Link 11: 25 GB/s
GPU 5: NVIDIA A100-SXM4-80GB (UUID: GPU-45f0c453-4760-edd4-3af9-25c5ea7473a5)
         Link 0: 25 GB/s
         Link 1: 25 GB/s
         Link 2: 25 GB/s
         Link 3: 25 GB/s
         Link 4: 25 GB/s
         Link 5: 25 GB/s
         Link 6: 25 GB/s
         Link 7: 25 GB/s
         Link 8: 25 GB/s
         Link 9: 25 GB/s
         Link 10: 25 GB/s
         Link 11: 25 GB/s
GPU 6: NVIDIA A100-SXM4-80GB (UUID: GPU-38409794-bb34-430e-3c50-90b42cb2bb72)
         Link 0: 25 GB/s
         Link 1: 25 GB/s
         Link 2: 25 GB/s
         Link 3: 25 GB/s
         Link 4: 25 GB/s
         Link 5: 25 GB/s
         Link 6: 25 GB/s
         Link 7: 25 GB/s
         Link 8: 25 GB/s
         Link 9: 25 GB/s
         Link 10: 25 GB/s
         Link 11: 25 GB/s
GPU 7: NVIDIA A100-SXM4-80GB (UUID: GPU-3fb478aa-801b-eb64-55c2-0ffc3f2ce404)
         Link 0: 25 GB/s
         Link 1: 25 GB/s
         Link 2: 25 GB/s
         Link 3: 25 GB/s
         Link 4: 25 GB/s
         Link 5: 25 GB/s
         Link 6: 25 GB/s
         Link 7: 25 GB/s
         Link 8: 25 GB/s
         Link 9: 25 GB/s
         Link 10: 25 GB/s
         Link 11: 25 GB/s
GPU 1: NVIDIA A100-SXM4-80GB (UUID: GPU-64a2f685-bb12-c4af-105c-0726ece9c8d7)
         Link 0: 25 GB/s
         Link 1: 25 GB/s
         Link 2: 25 GB/s
         Link 3: 25 GB/s
         Link 4: 25 GB/s
         Link 5: 25 GB/s
         Link 6: 25 GB/s
         Link 7: 25 GB/s
         Link 8: 25 GB/s
         Link 9: 25 GB/s
         Link 10: 25 GB/s
         Link 11: 25 GB/s
GPU 2: NVIDIA A100-SXM4-80GB (UUID: GPU-2269851b-71cd-f6c7-50c5-ba1525cf3ce8)
         Link 0: 25 GB/s
         Link 1: 25 GB/s
         Link 2: 25 GB/s
         Link 3: 25 GB/s
         Link 4: 25 GB/s
         Link 5: 25 GB/s
         Link 6: 25 GB/s
         Link 7: 25 GB/s
         Link 8: 25 GB/s
         Link 9: 25 GB/s
         Link 10: 25 GB/s
         Link 11: 25 GB/s
GPU 3: NVIDIA A100-SXM4-80GB (UUID: GPU-4c397bbf-95fc-5c29-918a-a429cbe45a7a)
         Link 0: 25 GB/s
         Link 1: 25 GB/s
         Link 2: 25 GB/s
         Link 3: 25 GB/s
         Link 4: 25 GB/s
         Link 5: 25 GB/s
         Link 6: 25 GB/s
         Link 7: 25 GB/s
         Link 8: 25 GB/s
         Link 9: 25 GB/s
         Link 10: 25 GB/s
         Link 11: 25 GB/s
GPU 4: NVIDIA A100-SXM4-80GB (UUID: GPU-0e350204-9fb6-2cbe-538e-8f7849658eb8)
         Link 0: 25 GB/s
         Link 1: 25 GB/s
         Link 2: 25 GB/s
         Link 3: 25 GB/s
         Link 4: 25 GB/s
         Link 5: 25 GB/s
         Link 6: 25 GB/s
         Link 7: 25 GB/s
         Link 8: 25 GB/s
         Link 9: 25 GB/s
         Link 10: 25 GB/s
         Link 11: 25 GB/s
GPU 5: NVIDIA A100-SXM4-80GB (UUID: GPU-45f0c453-4760-edd4-3af9-25c5ea7473a5)
         Link 0: 25 GB/s
         Link 1: 25 GB/s
         Link 2: 25 GB/s
         Link 3: 25 GB/s
         Link 4: 25 GB/s
         Link 5: 25 GB/s
         Link 6: 25 GB/s
         Link 7: 25 GB/s
         Link 8: 25 GB/s
         Link 9: 25 GB/s
         Link 10: 25 GB/s
         Link 11: 25 GB/s
GPU 6: NVIDIA A100-SXM4-80GB (UUID: GPU-38409794-bb34-430e-3c50-90b42cb2bb72)
         Link 0: 25 GB/s
         Link 1: 25 GB/s
         Link 2: 25 GB/s
         Link 3: 25 GB/s
         Link 4: 25 GB/s
         Link 5: 25 GB/s
         Link 6: 25 GB/s
         Link 7: 25 GB/s
         Link 8: 25 GB/s
         Link 9: 25 GB/s
         Link 10: 25 GB/s
         Link 11: 25 GB/s
GPU 7: NVIDIA A100-SXM4-80GB (UUID: GPU-3fb478aa-801b-eb64-55c2-0ffc3f2ce404)
         Link 0: 25 GB/s
         Link 1: 25 GB/s
         Link 2: 25 GB/s
         Link 3: 25 GB/s
         Link 4: 25 GB/s
         Link 5: 25 GB/s
         Link 6: 25 GB/s
         Link 7: 25 GB/s
         Link 8: 25 GB/s
         Link 9: 25 GB/s
         Link 10: 25 GB/s
         Link 11: 25 GB/s
Code block. Check NVIDIA NVLink status

2.5 - Install ServiceWatch Agent

Users can install the ServiceWatch Agent on a GPU server to collect custom metrics and logs.

Reference
Collecting custom metrics/logs via the ServiceWatch Agent is currently available only on Samsung Cloud Platform For Enterprise. It will also be available in other offerings in the future.
Caution
Metric collection through the ServiceWatch Agent is classified as custom metrics and, unlike the default metrics collected from each service, incurs charges; therefore, we recommend removing or disabling unnecessary metric collection settings.

ServiceWatch Agent

There are two main types of agents that need to be installed on a GPU server to collect custom metrics and logs for ServiceWatch. It is a Prometheus Exporter and Open Telemetry Collector.

CategoryDetailed description
Prometheus ExporterProvide metrics of a specific application or service in a format that Prometheus can scrape
  • To collect OS metrics from servers, you can use Node Exporter for Linux servers and Windows Exporter for Windows servers, depending on the OS type.
Open Telemetry CollectorActs as a centralized collector that gathers telemetry data such as metrics and logs from distributed systems, processes (filtering, sampling, etc.) it, and exports it to multiple backends (e.g., Prometheus, Jaeger, Elasticsearch, etc.)
  • Exports data to the ServiceWatch Gateway so that ServiceWatch can collect metric and log data.
Table. Explanation of Prometheus Exporter and Open Telemetry Collector
Caution

If you have configured a Kubernetes Engine on a GPU server, please check GPU metrics using the metrics provided by the Kubernetes Engine.

  • If you install the DCGM Exporter on a GPU server configured with Kubernetes Engine, it may not operate correctly.

Pre-configuration for Using ServiceWatch Agent

To utilize the ServiceWatch Agent, please refer to Prerequisite Settings for ServiceWatch Agent and prepare the prerequisite settings.

Install Prometheus Exporter for GPU metrics (for Ubuntu)

Install the Prometheus Exporter for collecting metrics from the GPU server according to the steps below.

Verify NVDIA Driver Installation

  • Check the installed NVDIA Driver.
    Color mode
    nvidia-smi --query-gpu driver_version --format csv
    nvidia-smi --query-gpu driver_version --format csv
    Code block. NVDIA Driver version check command
    Color mode
    driver_version
    535.183.06
    ...
    535.183.06
    driver_version
    535.183.06
    ...
    535.183.06
    Code block. NVDIA Driver version check example

NVSwitch Configuration and Query (NSCQ) Library Installation

Reference
The NVSwitch Configuration and Query (NSCQ) Library is required for Hopper or earlier generation GPUs.
Information
The installation commands below can be used in an environment with internet access. If you are in an environment without internet access, you must download libnvdia-nscq from https://developer.download.nvidia.com/compute/cuda/repos/ and upload it.
  1. Install cuda-keyring.

    Color mode
    wget https://developer.download.nvidia.com/compute/cuda/repos/<distro>/<arch>/cuda-keyring_1.1-1_all.deb
    wget https://developer.download.nvidia.com/compute/cuda/repos/<distro>/<arch>/cuda-keyring_1.1-1_all.deb
    code block. NSCQ library download command
    Color mode
    sudo dpkg -i cuda-keyring_1.1-1_all.deb
    apt update
    sudo dpkg -i cuda-keyring_1.1-1_all.deb
    apt update
    Code block. NSCQ library installation command
    Color mode
    nvidia-smi --query-gpu driver_version --format csv
    nvidia-smi --query-gpu driver_version --format csv
    Code block. NVDIA Driver version check command
    Color mode
    driver_version
    535.183.06
    ...
    
    535.183.06
    driver_version
    535.183.06
    ...
    
    535.183.06
    code block. Example of checking NVDIA driver version

  2. Install libnvidia-nscq.

    Color mode
    apt-cache policy libnvidia-nscq-535
    apt-cache policy libnvidia-nscq-535
    Code block. NSCQ library apt-cache command
    Color mode
    libnvidia-nscq-535:
      Installed: (none)
      Candidate: 535.247.01-1
      Version table:
         535.247.01-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
    ...
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         535.216.01-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         535.183.06-1 600  # Install the version that matches the driver
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         535.183.01-1 600
    
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         535.54.03-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
    libnvidia-nscq-535:
      Installed: (none)
      Candidate: 535.247.01-1
      Version table:
         535.247.01-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
    ...
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         535.216.01-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         535.183.06-1 600  # Install the version that matches the driver
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         535.183.01-1 600
    
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         535.54.03-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
    Code block. NSCQ library apt-cache command result
    Color mode
    apt install libnvidia-nscq-535=535.183.06-1
    apt install libnvidia-nscq-535=535.183.06-1
    Code block. NSCQ library installation command

information

It must be installed with the same version as the NVDIA Driver.

  • Example) driver version: 535.183.06, libnvdia-nscq version: 535.183.06-1

NVSwitch Device Monitoring API(NVSDM) Library Installation

Reference
For GPU architectures after Blackwell, installing the NVSDM Library is required. NVDIA Driver versions below 560 do not provide the NVSDM Library.
  • Install the NVSDM library.
    Color mode
    apt-cache policy libnvsdm
    apt-cache policy libnvsdm
    Code block. NVSDM library apt-cache command
    Color mode
    libnvsdm:
      Installed: (none)
      Candidate: 580.105.08-1
      Version table:
         580.105.08-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         580.95.05-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         580.82.07-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         580.65.06-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
    libnvsdm:
      Installed: (none)
      Candidate: 580.105.08-1
      Version table:
         580.105.08-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         580.95.05-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         580.82.07-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         580.65.06-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
    Code block. NVSDM library apt-cache command result
    Color mode
    apt install libnvsdm=580.105.08-1
    apt install libnvsdm=580.105.08-1
    Code block. Install NVSDM library

Install NVIDIA DCGM (for Ubuntu)

Install the DCGM Exporter according to the steps below.

  1. DCGM(datacenter-gpu-manager) Installation
  2. datacenter-gpu-manager-exporter installation
  3. DCGM Service Activation and Start

DCGM(datacenter-gpu-manager) Installation

refers to a specific version of NVIDIA’s Data Center GPU Manager (DCGM) tool, which is a package for managing and monitoring NVIDIA data center GPUs. In particular, cuda12 indicates that this management tool is installed for the CUDA 12 version, and datacenter-gpu-manager-4 refers to the 4.x version of DCGM. This tool provides a variety of features, including GPU status monitoring, diagnostics, alert system, and power/clock management.

  1. Check the CUDA version.
    Color mode
    nvidia-smi | grep CUDA
    nvidia-smi | grep CUDA
    Code block. Check CUDA version
    Color mode
    | NVIDIA-SMI 535.183.06             Driver Version: 535.183.06     CUDA Version: 12.2 |
    | NVIDIA-SMI 535.183.06             Driver Version: 535.183.06     CUDA Version: 12.2 |
    Code block. Example of CUDA version check result
    Color mode
    CUDA_VERSION=12
    CUDA_VERSION=12
    Code block. CUDA version setting command
  2. Install datacenter-gpu-manager-cuda.
    Color mode
    apt install datacenter-gpu-manager-4-cuda${CUDA_VERSION}
    apt install datacenter-gpu-manager-4-cuda${CUDA_VERSION}
    Code block. datacenter-gpu-manager-cuda installation command

Install datacenter-gpu-manager-exporter

It is a tool that, based on NVIDIA Data Center GPU Manager (DCGM), collects various GPU metrics such as GPU usage, memory usage, temperature, and power consumption, and exposes them for use in monitoring systems like Prometheus.

  1. Install datacenter-gpu-manager-exporter.
    Color mode
    apt install datacenter-gpu-manager-exporter
    apt install datacenter-gpu-manager-exporter
    Code block. datacenter-gpu-manager-exporter installation command
  2. Check the DCGM Exporter configuration file.
    Color mode
    cat /usr/lib/systemd/system/nvidia-dcgm-exporter.service | grep ExecStart
    cat /usr/lib/systemd/system/nvidia-dcgm-exporter.service | grep ExecStart
    Code block. Command to check the datacenter-gpu-manager-exporter configuration file
    Color mode
    ExecStart=/usr/bin/dcgm-exporter -f /etc/dcgm-exporter/default-counters.csv
    ExecStart=/usr/bin/dcgm-exporter -f /etc/dcgm-exporter/default-counters.csv
    code block. Example of verification result for the datacenter-gpu-manager-exporter configuration file
  3. When installing the DCGM Exporter, review the provided configuration and remove # for the required metrics, and add # for the unnecessary metrics.
    Color mode
    vi /etc/dcgm-exporter/default-counters.csv
    ## Example ##
    ...
    DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active.
    DCGM_FI_PROF_DRAM_ACTIVE,        gauge, Ratio of cycles the device memory interface is active sending or receiving data.
    # DCGM_FI_PROF_PIPE_FP64_ACTIVE,   gauge, Ratio of cycles the fp64 pipes are active.
    # DCGM_FI_PROF_PIPE_FP32_ACTIVE,   gauge, Ratio of cycles the fp32 pipes are active.
    ...
    vi /etc/dcgm-exporter/default-counters.csv
    ## Example ##
    ...
    DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active.
    DCGM_FI_PROF_DRAM_ACTIVE,        gauge, Ratio of cycles the device memory interface is active sending or receiving data.
    # DCGM_FI_PROF_PIPE_FP64_ACTIVE,   gauge, Ratio of cycles the fp64 pipes are active.
    # DCGM_FI_PROF_PIPE_FP32_ACTIVE,   gauge, Ratio of cycles the fp32 pipes are active.
    ...
    Code block. Example of datacenter-gpu-manager-exporter metric configuration
Reference
See DCGM Exporter Metrics for the metrics that can be collected with the GPU DCGM Exporter and how to configure them.
Caution
Since metric collection through the ServiceWatch Agent is classified as custom metrics and incurs charges unlike the default collected metrics, unnecessary metric collection must be removed or disabled to avoid excessive charges.

Enable and start DCGM service

  1. Enable and start the nvdia-dcgm service.

    Color mode
    systemctl enable --now nvidia-dcgm
    systemctl enable --now nvidia-dcgm
    Code block. nvdia-dcgm service activation and start command

  2. Enable and start the nvdia-dcgm-exporter service.

    Color mode
    systemctl enable --now nvidia-dcgm-exporter
    systemctl enable --now nvidia-dcgm-exporter
    Code block. nvdia-dcgm-exporter service activation and start command

information
If you have completed the DCGM Exporter setup, you must install the Open Telemetry Collector provided by ServiceWatch to finish configuring the ServiceWatch Agent.
For more details, refer to ServiceWatch > ServiceWatch Agent 사용하기.

Install Prometheus Exporter for GPU metrics (for RHEL)

Install the ServiceWatch Agent to collect metrics from the GPU server according to the steps below.

NVDIA Driver installation verification (for RHEL)

  1. Check the installed NVDIA Driver.
    Color mode
    nvidia-smi --query-gpu driver_version --format csv
    nvidia-smi --query-gpu driver_version --format csv
    Code block. NVDIA Driver version check command
    Color mode
    driver_version
    535.183.06
    ...
    
    535.183.06
    driver_version
    535.183.06
    ...
    
    535.183.06
    Code block. Example of checking NVDIA Driver version

NVSwitch Configuration and Query (NSCQ) Library installation (for RHEL)

Reference

The NVSwitch Configuration and Query (NSCQ) Library is required for Hopper or earlier generation GPUs.

  • For RHEL, verify that libnvdia-nscq is installed and install it if necessary.
information
The installation commands below can be executed in an environment with internet access. If you are in an environment without internet access, you must download libnvdia-nscq from https://developer.download.nvidia.com/compute/cuda/repos/ and upload it.
  1. Checking the libnvdia-nscq package.

    Color mode
    rpm -qa | grep libnvidia-nscq libnvidia-nscq-535-535.183.06-1.x86_64
    rpm -qa | grep libnvidia-nscq libnvidia-nscq-535-535.183.06-1.x86_64
    Code block. Check NSCQ library package

  2. Add the CUDA Repository to DNF.

    Color mode
    dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
    dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
    Code block. Add DNF Repository

  3. NVDIA Driver status reset

    Color mode
    dnf module reset nvidia-driver
    dnf module reset nvidia-driver
    Code block. Initialize the state of the NVIDIA Driver DNF module
    Color mode
    Updating Subscription Management repositories.
    Last metadata expiration check: 0:03:15 ago on Wed 19 Nov 2025 01:23:48 AM EST.
    Dependencies resolved.
    =============================================
    Package Architecture Version Repository Size
    =============================================
    Disabling module profiles:
    nvidia-driver/default
    nvidia-driver/fm
    Resetting modules:
    nvidia-driver
    
    Transaction Summary
    =============================================
    
    Is this ok [y/N]: y
    Updating Subscription Management repositories.
    Last metadata expiration check: 0:03:15 ago on Wed 19 Nov 2025 01:23:48 AM EST.
    Dependencies resolved.
    =============================================
    Package Architecture Version Repository Size
    =============================================
    Disabling module profiles:
    nvidia-driver/default
    nvidia-driver/fm
    Resetting modules:
    nvidia-driver
    
    Transaction Summary
    =============================================
    
    Is this ok [y/N]: y
    Code block. Example of the status initialization result of the NVIDIA Driver DNF module.

  4. Enable the NVDIA Driver module.

    Color mode
    dnf module enable nvidia-driver:535-open
    dnf module enable nvidia-driver:535-open
    code block. NVDIA Driver module activation
    Color mode
    Updating Subscription Management repositories.
    Last metadata expiration check: 0:04:22 ago on Wed 19 Nov 2025 01:23:48 AM EST.
    Dependencies resolved.
    =============================================
    Package Architecture Version Repository Size
    =============================================
    Enabling module streams:
    nvidia-driver 535-open
    
    Transaction Summary
    =============================================
    
    Is this ok [y/N]: y
    Updating Subscription Management repositories.
    Last metadata expiration check: 0:04:22 ago on Wed 19 Nov 2025 01:23:48 AM EST.
    Dependencies resolved.
    =============================================
    Package Architecture Version Repository Size
    =============================================
    Enabling module streams:
    nvidia-driver 535-open
    
    Transaction Summary
    =============================================
    
    Is this ok [y/N]: y
    Code block. Example of NVDIA Driver module activation result

  5. Check the libnvdia-nscq module list.

    Color mode
    dnf list libnvidia-nscq-535 --showduplicates
    dnf list libnvidia-nscq-535 --showduplicates
    Code block. Check libnvdia-nscq module list

  6. Install libnvdia-nscq.

    Color mode
    dnf install libnvidia-nscq-535-535.183.06-1
    dnf install libnvidia-nscq-535-535.183.06-1
    code block. libnvdia-nscq installation command

NVSwitch Device Monitoring API(NVSDM) Library Installation (for RHEL)

Reference
For GPU architectures after Blackwell, installing the NVSDM Library is required. NVDIA Driver versions below 560 do not provide the NVSDM Library.
  1. Check the NVSDM library module list.

    Color mode
    dnf list libnvsdm --showduplicates
    dnf list libnvsdm --showduplicates
    Code block. Check NVSDM library module list
    Color mode
    libnvsdm:
      Installed: (none)
      Candidate: 580.105.08-1
      Version table:
         580.105.08-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         580.95.05-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         580.82.07-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         580.65.06-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
    libnvsdm:
      Installed: (none)
      Candidate: 580.105.08-1
      Version table:
         580.105.08-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         580.95.05-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         580.82.07-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         580.65.06-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
    Code block. Example of NVSDM library module list verification results.

  2. Install libnvsdm.

    Color mode
    dnf install libnvsdm-580.105.08-1
    dnf install libnvsdm-580.105.08-1
    Code block. Install NVSDM library
    Color mode
    Updating Subscription Management repositories.
    Last metadata expiration check: 0:08:18 ago on Wed 19 Nov 2025 01:05:28 AM EST.
    Dependencies resolved.
    =========================================================================
    Package Architecture Version Repository Size
    =========================================================================
    Installing:
    libnvsdm x86_64 580.105.08-1 cuda-rhel8-x86_64 675 k
    Installing dependencies:
    infiniband-diags x86_64 48.0-1.el8 rhel-8-for-x86_64-baseos-rpms 323 k
    libibumad x86_64 48.0-1.el8 rhel-8-for-x86_64-baseos-rpms 34 k
    
    Transaction Summary
    =========================================================================
    Install 3 Packages
    
    Total download size: 1.0 M
    Installed size: 3.2 M
    Is this ok [y/N]: y
    Updating Subscription Management repositories.
    Last metadata expiration check: 0:08:18 ago on Wed 19 Nov 2025 01:05:28 AM EST.
    Dependencies resolved.
    =========================================================================
    Package Architecture Version Repository Size
    =========================================================================
    Installing:
    libnvsdm x86_64 580.105.08-1 cuda-rhel8-x86_64 675 k
    Installing dependencies:
    infiniband-diags x86_64 48.0-1.el8 rhel-8-for-x86_64-baseos-rpms 323 k
    libibumad x86_64 48.0-1.el8 rhel-8-for-x86_64-baseos-rpms 34 k
    
    Transaction Summary
    =========================================================================
    Install 3 Packages
    
    Total download size: 1.0 M
    Installed size: 3.2 M
    Is this ok [y/N]: y
    Code block. Example of the NVSDM library installation command result

Install NVIDIA DCGM (for RHEL)

Install Node Exporter according to the steps below.

  1. DCGM(datacenter-gpu-manager) Installation
  2. datacenter-gpu-manager-exporter installation
  3. DCGM Service Activation and Start

DCGM(datacenter-gpu-manager) Installation (for RHEL)

refers to a specific version of NVIDIA’s Data Center GPU Manager (DCGM) tool, which is a package for managing and monitoring NVIDIA data center GPUs. In particular, cuda12 indicates that this management tool is installed for the CUDA 12 version, and datacenter-gpu-manager-4 refers to the 4.x version of DCGM. This tool provides various features, including GPU status monitoring, diagnostics, alert system, and power/clock management.

  1. Add the CUDA repository to DNF.
    Color mode
    dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
    dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
    code block. Add DNF Repository
  2. Check the CUDA version.
    Color mode
    nvidia-smi | grep CUDA
    nvidia-smi | grep CUDA
    Code block. Check CUDA version
    Color mode
    | NVIDIA-SMI 535.183.06             Driver Version: 535.183.06     CUDA Version: 12.2 |
    | NVIDIA-SMI 535.183.06             Driver Version: 535.183.06     CUDA Version: 12.2 |
    Code block. Example of CUDA version check result
    Color mode
    CUDA_VERSION=12
    CUDA_VERSION=12
    Code block. CUDA version setting command
  3. Check the list of datacenter-gpu-manager-cuda modules.
    Color mode
    dnf list datacenter-gpu-manager-4-cuda${CUDA_VERSION} --showduplicates
    dnf list datacenter-gpu-manager-4-cuda${CUDA_VERSION} --showduplicates
    Code block. Check the datacenter-gpu-manager-cuda module list
    Color mode
    Updating Subscription Management repositories.
    Unable to read consumer identity
    
    This system is not registered with an entitlement server. You can use subscription-manager to register.
    
    Last metadata expiration check: 0:00:34 ago on Wed 19 Nov 2025 12:26:56 AM EST.
    Available Packages
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.0.0-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.1.0-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.1.1-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.2.0-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.2.2-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.2.3-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.2.3-2    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.3.0-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.3.1-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.4.0-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.4.1-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.4.2-1    cuda-rhel8-x86_64
    Updating Subscription Management repositories.
    Unable to read consumer identity
    
    This system is not registered with an entitlement server. You can use subscription-manager to register.
    
    Last metadata expiration check: 0:00:34 ago on Wed 19 Nov 2025 12:26:56 AM EST.
    Available Packages
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.0.0-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.1.0-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.1.1-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.2.0-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.2.2-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.2.3-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.2.3-2    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.3.0-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.3.1-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.4.0-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.4.1-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.4.2-1    cuda-rhel8-x86_64
    Code block. Example of checking the list of datacenter-gpu-manager-cuda modules.
  4. Install datacenter-gpu-manager-cuda.
    Color mode
    dnf install datacenter-gpu-manager-4-cuda${CUDA_VERSION}
    dnf install datacenter-gpu-manager-4-cuda${CUDA_VERSION}
    Code block. Install datacenter-gpu-manager-cuda
    Color mode
    Updating Subscription Management repositories.
    Unable to read consumer identity
    
    This system is not registered with an entitlement server. You can use subscription-manager to register.
    
    Last metadata expiration check: 0:07:12 ago on Wed 19 Nov 2025 12:26:56 AM EST.
    Dependencies resolved.
    ===================================================================================================
     Package                                       Architecture   Version     Repository          Size
    ===================================================================================================
    Installing:
     datacenter-gpu-manager-4-cuda12               x86_64         1:4.4.2-1   cuda-rhel8-x86_64   554 M
    Installing dependencies:
     datacenter-gpu-manager-4-core                 x86_64         1:4.4.2-1   cuda-rhel8-x86_64   9.9 M
    Installing weak dependencies:
     datacenter-gpu-manager-4-proprietary          x86_64         1:4.4.2-1   cuda-rhel8-x86_64   5.3 M
     datacenter-gpu-manager-4-proprietary-cuda12   x86_64         1:4.4.2-1   cuda-rhel8-x86_64   289 M
    
    Transaction Summary
    ====================================================================================================
    Install  4 Packages
    ...
    Is this ok [y/N]: y
    Updating Subscription Management repositories.
    Unable to read consumer identity
    
    This system is not registered with an entitlement server. You can use subscription-manager to register.
    
    Last metadata expiration check: 0:07:12 ago on Wed 19 Nov 2025 12:26:56 AM EST.
    Dependencies resolved.
    ===================================================================================================
     Package                                       Architecture   Version     Repository          Size
    ===================================================================================================
    Installing:
     datacenter-gpu-manager-4-cuda12               x86_64         1:4.4.2-1   cuda-rhel8-x86_64   554 M
    Installing dependencies:
     datacenter-gpu-manager-4-core                 x86_64         1:4.4.2-1   cuda-rhel8-x86_64   9.9 M
    Installing weak dependencies:
     datacenter-gpu-manager-4-proprietary          x86_64         1:4.4.2-1   cuda-rhel8-x86_64   5.3 M
     datacenter-gpu-manager-4-proprietary-cuda12   x86_64         1:4.4.2-1   cuda-rhel8-x86_64   289 M
    
    Transaction Summary
    ====================================================================================================
    Install  4 Packages
    ...
    Is this ok [y/N]: y
    Code block. Example of datacenter-gpu-manager-cuda installation result

datacenter-gpu-manager-exporter installation (for RHEL)

It is a tool that, based on NVIDIA Data Center GPU Manager (DCGM), collects various GPU metrics such as GPU usage, memory usage, temperature, and power consumption, and exposes them for use in monitoring systems like Prometheus.

  1. Add the CUDA Repository to DNF. 1. (If you have already performed this command, proceed to the next step.)

    Color mode
    dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
    dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
    Code block. Add DNF Repository

  2. Check the CUDA version. 2. (If you have already performed this command, proceed to the next step.)

    Color mode
    nvidia-smi | grep CUDA
    nvidia-smi | grep CUDA
    Code block. Check CUDA version
    Color mode
    | NVIDIA-SMI 535.183.06             Driver Version: 535.183.06     CUDA Version: 12.2 |
    | NVIDIA-SMI 535.183.06             Driver Version: 535.183.06     CUDA Version: 12.2 |
    Code block. Example of CUDA version check result
    Color mode
    CUDA_VERSION=12
    CUDA_VERSION=12
    Code block. CUDA version setting command

  3. Check the datacenter-gpu-manager-exporter module list.

    Color mode
    dnf list datacenter-gpu-manager-exporter --showduplicates
    dnf list datacenter-gpu-manager-exporter --showduplicates
    Code block. Check the list of datacenter-gpu-manager-exporter modules
    Color mode
    Updating Subscription Management repositories.
    Unable to read consumer identity
    
    This system is not registered with an entitlement server. You can use subscription-manager to register.
    
    Last metadata expiration check: 0:02:11 ago on Wed 19 Nov 2025 12:26:56 AM EST.
    Available Packages
    datacenter-gpu-manager-exporter.x86_64   4.0.1-1   cuda-rhel8-x86_64
    datacenter-gpu-manager-exporter.x86_64   4.1.0-1   cuda-rhel8-x86_64
    datacenter-gpu-manager-exporter.x86_64   4.1.1-1   cuda-rhel8-x86_64
    datacenter-gpu-manager-exporter.x86_64   4.1.3-1   cuda-rhel8-x86_64
    datacenter-gpu-manager-exporter.x86_64   4.5.0-1   cuda-rhel8-x86_64
    datacenter-gpu-manager-exporter.x86_64   4.5.1-1   cuda-rhel8-x86_64
    datacenter-gpu-manager-exporter.x86_64   4.5.2-1   cuda-rhel8-x86_64
    datacenter-gpu-manager-exporter.x86_64   4.6.0-1   cuda-rhel8-x86_64
    Updating Subscription Management repositories.
    Unable to read consumer identity
    
    This system is not registered with an entitlement server. You can use subscription-manager to register.
    
    Last metadata expiration check: 0:02:11 ago on Wed 19 Nov 2025 12:26:56 AM EST.
    Available Packages
    datacenter-gpu-manager-exporter.x86_64   4.0.1-1   cuda-rhel8-x86_64
    datacenter-gpu-manager-exporter.x86_64   4.1.0-1   cuda-rhel8-x86_64
    datacenter-gpu-manager-exporter.x86_64   4.1.1-1   cuda-rhel8-x86_64
    datacenter-gpu-manager-exporter.x86_64   4.1.3-1   cuda-rhel8-x86_64
    datacenter-gpu-manager-exporter.x86_64   4.5.0-1   cuda-rhel8-x86_64
    datacenter-gpu-manager-exporter.x86_64   4.5.1-1   cuda-rhel8-x86_64
    datacenter-gpu-manager-exporter.x86_64   4.5.2-1   cuda-rhel8-x86_64
    datacenter-gpu-manager-exporter.x86_64   4.6.0-1   cuda-rhel8-x86_64
    code block. Example of checking the module list of datacenter-gpu-manager-exporter

  4. Install datacenter-gpu-manager-cuda. dcgm-exporter 4.5.X requires glibc 2.34 or newer, but because RHEL 9 provides glibc 2.34, we install it by specifying version 4.1.3-1.

    Color mode
    dnf install datacenter-gpu-manager-exporter-4.1.3-1
    dnf install datacenter-gpu-manager-exporter-4.1.3-1
    Code block. Install datacenter-gpu-manager-cuda
    Color mode
    Updating Subscription Management repositories.
    Unable to read consumer identity
    
    This system is not registered with an entitlement server. You can use subscription-manager to register.
    
    Last metadata expiration check: 0:07:12 ago on Wed 19 Nov 2025 12:26:56 AM EST.
    Dependencies resolved.
    ====================================================================================================
     Package                                       Architecture   Version     Repository          Size
    ====================================================================================================
    Installing:
     datacenter-gpu-manager-exporter               x86_64         4.1.3-1     cuda-rhel8-x86_64   26 M
    
    
    Is this ok [y/N]: y
    Updating Subscription Management repositories.
    Unable to read consumer identity
    
    This system is not registered with an entitlement server. You can use subscription-manager to register.
    
    Last metadata expiration check: 0:07:12 ago on Wed 19 Nov 2025 12:26:56 AM EST.
    Dependencies resolved.
    ====================================================================================================
     Package                                       Architecture   Version     Repository          Size
    ====================================================================================================
    Installing:
     datacenter-gpu-manager-exporter               x86_64         4.1.3-1     cuda-rhel8-x86_64   26 M
    
    
    Is this ok [y/N]: y
    Code block. Example of datacenter-gpu-manager-cuda installation result
    Color mode
    cat /usr/lib/systemd/system/nvidia-dcgm-exporter.service | grep ExecStart
    cat /usr/lib/systemd/system/nvidia-dcgm-exporter.service | grep ExecStart
    Code block. datacenter-gpu-manager-exporter configuration file
    Color mode
    ExecStart=/usr/bin/dcgm-exporter -f /etc/dcgm-exporter/default-counters.csv
    ExecStart=/usr/bin/dcgm-exporter -f /etc/dcgm-exporter/default-counters.csv
    Code block. Example of the result of checking the datacenter-gpu-manager-exporter configuration file.

  5. When installing the DCGM Exporter, review the provided configuration and remove # for required metrics, and add # for unnecessary metrics.

    Color mode
    vi /etc/dcgm-exporter/default-counters.csv
    ## Example ##
    ...
    DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active.
    DCGM_FI_PROF_DRAM_ACTIVE,        gauge, Ratio of cycles the device memory interface is active sending or receiving data.
    # DCGM_FI_PROF_PIPE_FP64_ACTIVE,   gauge, Ratio of cycles the fp64 pipes are active.
    # DCGM_FI_PROF_PIPE_FP32_ACTIVE,   gauge, Ratio of cycles the fp32 pipes are active.
    ...
    vi /etc/dcgm-exporter/default-counters.csv
    ## Example ##
    ...
    DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active.
    DCGM_FI_PROF_DRAM_ACTIVE,        gauge, Ratio of cycles the device memory interface is active sending or receiving data.
    # DCGM_FI_PROF_PIPE_FP64_ACTIVE,   gauge, Ratio of cycles the fp64 pipes are active.
    # DCGM_FI_PROF_PIPE_FP32_ACTIVE,   gauge, Ratio of cycles the fp32 pipes are active.
    ...
    Code block. Example of datacenter-gpu-manager-exporter metric configuration

Reference
Refer to DCGM Exporter Metrics for the metrics that can be collected with the GPU DCGM Exporter and how to configure them.
Caution
Since metric collection through the ServiceWatch Agent is classified as custom metrics and incurs charges unlike the default collected metrics, unnecessary metric collection must be removed or disabled to avoid excessive charges.

Enable and start DCGM service (for RHEL)

  1. Activating and starting the nvdia-dcgm service.

    Color mode
    systemctl enable --now nvidia-dcgm
    systemctl enable --now nvidia-dcgm
    Code block. nvdia-dcgm service activation and start command

  2. Enable and start the nvdia-dcgm-exporter service.

    Color mode
    systemctl enable --now nvidia-dcgm-exporter
    systemctl enable --now nvidia-dcgm-exporter
    Code block. nvdia-dcgm-exporter service activation and start command

guide
If you have completed the DCGM Exporter configuration, you must install the Open Telemetry Collector provided by ServiceWatch and finish setting up the ServiceWatch Agent.
For more details, refer to ServiceWatch > Using ServiceWatch Agent.

DCGM Exporter metric

DCGM Exporter Key Metrics

Among the metrics provided by the DCGM Exporter, the main GPU metrics are as follows.

CategoryDCGM FieldPrometheus Metric TypeSummary
ClocksDCGM_FI_DEV_SM_CLOCKgaugeSM clock frequency (in MHz)
ClocksDCGM_FI_DEV_MEM_CLOCKgaugeMemory clock frequency (in MHz)
TemperatureDCGM_FI_DEV_GPU_TEMPgaugeGPU temperature (in C)
PowerDCGM_FI_DEV_POWER_USAGEgaugePower draw (in W)
UtilizationDCGM_FI_DEV_GPU_UTILgaugeGPU utilization (in %)
UtilizationDCGM_FI_DEV_MEM_COPY_UTILgaugeMemory utilization (in %)
Memory UsageDCGM_FI_DEV_FB_FREEgaugeFrame buffer memory free (in MiB)
Memory UsageDCGM_FI_DEV_FB_USEDgaugeFrame buffer memory used (in MiB)
NvlinkDCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL(8 GPU only)counterTotal number of NVLink bandwidth counters for all lanes
Table. Major GPU metrics provided by DCGM Exporter

DCGM Exporter metric collection configuration

Please refer to the default metrics of DCGM Exporter at DCGM Exporter > 기본 지표.

  • Besides the default configuration, remove # from default-counters.csv for any additional metrics you want to set.
  • For default metrics you do not wish to collect, add # or delete the corresponding entry.
Color mode
# Format
# If line starts with a '#' it is considered a comment
# DCGM FIELD, Prometheus metric type, help message

# Clocks
DCGM_FI_DEV_SM_CLOCK, gauge, SM clock frequency (in MHz).
DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).

# Temperature
DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).
DCGM_FI_DEV_GPU_TEMP,    gauge, GPU temperature (in C).

# Power
DCGM_FI_DEV_POWER_USAGE,              gauge, Power draw (in W).
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ).

# PCIE
# DCGM_FI_PROF_PCIE_TX_BYTES,  counter, Total number of bytes transmitted through PCIe TX via NVML.
# DCGM_FI_PROF_PCIE_RX_BYTES,  counter, Total number of bytes received through PCIe RX via NVML.
...
# Format
# If line starts with a '#' it is considered a comment
# DCGM FIELD, Prometheus metric type, help message

# Clocks
DCGM_FI_DEV_SM_CLOCK, gauge, SM clock frequency (in MHz).
DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).

# Temperature
DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).
DCGM_FI_DEV_GPU_TEMP,    gauge, GPU temperature (in C).

# Power
DCGM_FI_DEV_POWER_USAGE,              gauge, Power draw (in W).
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ).

# PCIE
# DCGM_FI_PROF_PCIE_TX_BYTES,  counter, Total number of bytes transmitted through PCIe TX via NVML.
# DCGM_FI_PROF_PCIE_RX_BYTES,  counter, Total number of bytes received through PCIe RX via NVML.
...
code block. default-counters.csv configuration example

3 - API Reference

API Reference

4 - CLI Reference

CLI Reference

5 - Release Note

GPU Server

2026.03.19
FEATURE Add Kubernetes image and SSD_Provisioned disk type
  • Add Kubernetes image
    • When creating a GPU Server, you can select the Kubernetes image (Ubuntu).
  • An SSD volume with configurable IOPS and throughput has been added.
    • When creating Block Storage, you can select the SSD_Provisioned disk type.
    • You can set the maximum IOPS and Throughput values.
2025.10.23
FEATURE Add new features and provide ServiceWatch service integration functionality
  • Provision of ServiceWatch service integration
    • You can monitor data through the ServiceWatch service.
  • When creating a GPU server, you can select a RHEL image.
  • Keypair management feature has been added.
    • You can generate a keypair for use, or retrieve a public key and apply it.
2025.07.01
FEATURE Add GPU Server feature, change Image sharing method, and add GPU Server usage guide
  • Add GPU Server feature
    • The IP, Public NAT IP, and Private NAT IP configuration features have been added.
    • An LLM Endpoint is provided for LLM usage.
  • The method for sharing images between accounts has changed.
    • You can create a new Image for sharing and share it.
  • Add GPU Server usage guide
2025.04.28
FEATURE Add OS image
  • GPU Server RHEL OS and GPU driver versions have been added.
2025.02.27
FEATURE Common feature change
  • Add GPU Server feature
    • NAT configuration feature has been added to the GPU Server.
  • Samsung Cloud Platform Common Feature Changes
    • Account, IAM, Service Home, tags, and other common CX changes have been applied.
2024.10.01
NEW GPU Server service official version release
  • We have officially launched the GPU Server service.
  • We have launched a virtualization computing service that lets you allocate and use infrastructure resources such as CPU, GPU, and memory provided by the server as needed, without having to purchase them individually.