This is the multi-page printable view of this section. Click here to print.
GPU Server
1 - Overview
Service Overview
GPU Server is a virtualized computing service that lets you freely allocate and use infrastructure resources such as CPU, GPU, and memory provided by the server, without having to purchase them individually, and allocate as much as needed at the required time. It is suitable for tasks that require fast computation speed, such as AI model experimentation, prediction, and inference in a cloud environment, and you can flexibly select and use resources with optimized performance based on the type and scale of the work. The GPU Server provides the following features.
Provided Features
- GPU Server Management: Through a web-based console, users can directly Self Service create, delete, and modify GPU Server provisioning, monitoring, and billing.
- Product offerings by GPU quantity: Depending on the project’s purpose and scale, you can freely select the number of H100/A100 GPUs to configure a virtual server.
- High‑Performance GPU Provision: We provide a high‑performance GPU server at physical‑server level using a pass‑through method.
- Storage Connection: Provides additional attached storage besides the OS disk. * You can connect and use Block Storage, File Storage, and Object Storage.
- Strong Security Application: Use the Security Group service to control inbound/outbound traffic exchanged with the external internet or other VPC(Virtual Private Cloud), securely protecting the server.
- Monitoring: You can view monitoring information such as the status of computing resources—including CPU, Memory, Disk, and GPU—through the Cloud Monitoring service.
- Network Configuration Management: The server’s subnet/IP can be easily changed from the values set at initial creation. * NAT IP provides a management feature that lets you enable or disable it as needed.
- Key Pair method: To ensure a secure OS access method, we provide a Key Pair method instead of ID/PW login.
- Image Management: You can create and manage Custom Images, and it provides sharing functionality between projects.
- ServiceWatch Service Integration Offering: You can monitor data through the ServiceWatch service.
Component
GPU Server provides GPUs, NVSwitch, and NVLink on top of virtualized computing resources.
- NVSwitch can only be enabled and used for instance types that allocate eight GPUs on a single GPU server.
Specifications by GPU Type
GPU (Graphic Processing Unit) performs the calculations needed to create images that compose the computer screen, and because it is specialized for parallel processing, it can handle large amounts of data quickly, processing large-scale parallel operations such as artificial intelligence (AI) and data analysis.
The following are the specifications of the GPU Types offered by the GPU Server service.
| Category | A100 Type | H100 Type | B300 Type |
|---|---|---|---|
| GPU Architecture | NVIDIA Ampere | NVIDIA Hopper | NVIDIA Blackwell Ultra |
| GPU Memory | 80 GiB | 80 GiB | 268 GiB |
| GPU Transistors | 54 billion 7N TSMC | 80 billion 4N TSMC | 208 billion 4NP TSMC |
| FP16 Tensor Core (Dense) | 312 TFLOPs | 989 TFLOPs | 2.25 PFLOPs |
| FP8 Tensor Core (Dense) | Not supported | 1,979 TFLOPs | 4.5 PFLOPs |
| FP4 Tensor Core (Dense) | Unsupported | Not supported | 13.5 PFLOPs |
| GPU Memory Bandwidth | 2,039 GB/s HBM2e | 3,352 GB/s HBM3 | 8 TB/s HBM3e |
| NVLink performance | NVLink 3 | NVLink 4 | NVLink 5 |
| NVLink Signaling Rate | 25 GB/s (x12) | 25 GB/s (x18) | 50 GB/s (x18) |
| NVSwitch GPU-to-GPU bandwidth | 600 GB/s | 900 GB/s | 1.8 TB/s |
| Total NVSwitch aggregate bandwidth | 4.8 TB/s | 7.2 TB/s | 14.4 TB/s |
Server type
The server types offered by the GPU Server are as follows. For detailed information about the server types provided by GPU Server, refer to GPU Server 서버 타입.
| Category | Server type | CPU vCore | Memory(GB) | Number of GPUs |
|---|---|---|---|---|
| GPU-A100-1 | g1v16a1 | 16 | 234 | 1 |
| GPU-A100-1 | g1v32a2 | 32 | 468 | 2 |
| GPU-A100-1 | g1v64a4 | 64 | 936 | 4 |
| GPU-A100-1 | g1v128a8 | 128 | 1,872 | 8 |
| GPU-H100-2 | g2v12h1 | 12 | 234 | 1 |
| GPU-H100-2 | g2v24h2 | 24 | 468 | 2 |
| GPU-H100-2 | g2v48h4 | 48 | 936 | 4 |
| GPU-H100-2 | g2v96h8 | 96 | 1,872 | 8 |
| GPU-B300-3 | g3v16b1 | 16 | 480 | 1 |
| GPU-B300-3 | g3v32b2 | 32 | 960 | 2 |
| GPU-B300-3 | g3v64b4 | 64 | 1,920 | 4 |
| GPU-B300-3 | g3v128b8 | 128 | 3,840 | 8 |
OS and GPU driver version
The operating systems (OS) supported by the GPU Server are as follows. Note that GPUs of type B300 are supported only from a specific GPU version onward, so please be careful when selecting images.
| OS | OS version | GPU driver version | Server type classification |
|---|---|---|---|
| Ubuntu | 24.04 | 580.126.20 | GPU-B300-3, GPU-H100-2, GPU-A100-1 |
| Ubuntu | 24.04 | 570.195.03 | GPU-H100-2, GPU-A100-1 |
| Ubuntu | 22.04 | 535.183.06 | GPU-H100-2, GPU-A100-1 |
| RHEL | 9.6 | 580.126.20 | GPU-B300-3, GPU-H100-2, GPU-A100-1 |
| RHEL | 8.10 | 580.126.20 | GPU-B300-3, GPU-H100-2, GPU-A100-1 |
| RHEL | 8.10 | 535.183.06 | GPU-H100-2, GPU-A100-1 |
Preceding Service
This is a service that must be installed in advance before creating this service. Please prepare by referring to the user guide provided in advance.
| Service Category | service | Detailed description |
|---|---|---|
| Networking | VPC | A service that provides an isolated virtual network in a cloud environment |
| Networking | Security Group | Virtual firewall that controls server traffic |
1.1 - Server type
GPU Server server type
GPU servers are categorized by the GPU type they provide, and the GPU used in a GPU server is determined by the server type selected when creating the GPU server. Please select the server type based on the specifications of the application you want to run on the GPU server.
The server types supported by the GPU Server are as follows.
GPU-H100-2 g2v12h1
Category | example | Detailed description |
|---|---|---|
| Server type | GPU-H100-2 | Provided server type classification
|
| Server specifications | g2 | Provided server type classification and generation
|
| Server specifications | v12 | Number of vCores
|
| Server specifications | h1 | GPU type and quantity
|
g1 server type
The g1 server type is a GPU Server that uses the NVIDIA A100 Tensor Core GPU, suitable for high-performance applications.
- Up to 8 NVIDIA A100 Tensor Core GPUs provided
- Equipped with 6,912 CUDA cores and 432 Tensor cores per GPU
- Supports up to 128 vCPUs and 1,920 GB of memory
- Networking speed up to 40 Gbps
- 600 GB/s GPU and NVIDIA NVSwitch P2P communication
| Category | Server type | GPU | CPU | Memory | GPU Memory | Network Bandwidth |
|---|---|---|---|---|---|---|
| GPU-A100-1 | g1v16a1 | 1 | 16 vCore | 234 GB | 80 GiB | Maximum 20 Gbps |
| GPU-A100-1 | g1v32a2 | 2 | 32 vCore | 468 GB | 160 GiB | Up to 20 Gbps |
| GPU-A100-1 | g1v64a4 | 4 | 64 vCore | 936 GB | 320 GiB | Maximum 40 Gbps |
| GPU-A100-1 | g1v128a8 | 8 | 128 vCore | 1,872 GB | 640 GiB | Maximum 40 Gbps |
g2 server type
The g2 server type is a GPU Server that uses the NVIDIA H100 Tensor Core GPU, making it suitable for high-performance applications.
- Up to 8 NVIDIA H100 Tensor Core GPUs provided
- Equipped with 16,896 CUDA cores and 528 Tensor cores per GPU
- Supports up to 96 vCPUs and 1,920 GB of memory
- Networking speed up to 40 Gbps
- 900 GB/s GPU and NVIDIA NVSwitch P2P communication
| Category | Server type | GPU | CPU | Memory | GPU Memory | Network Bandwidth |
|---|---|---|---|---|---|---|
| GPU-H100-2 | g2v12h1 | 1 | 12 vCore | 234 GB | 80 GiB | Up to 20 Gbps |
| GPU-H100-2 | g2v24h2 | 2 | 24 vCore | 468 GB | 160 GiB | Up to 20 Gbps |
| GPU-H100-2 | g2v48h4 | 4 | 48 vCore | 936 GB | 320 GiB | Maximum 40 Gbps |
| GPU-H100-2 | g2v96h8 | 8 | 96 vCore | 1,872 GB | 640 GiB | Maximum 40 Gbps |
g3 server type
The g3 server type is a GPU Server that uses the NVIDIA B300 Tensor Core GPU, making it suitable for high-performance applications.
- Up to 8 NVIDIA B300 Tensor Core GPUs provided
- Equipped with 20,480 CUDA cores and 640 Tensor cores per GPU
- Supports up to 128 vCPUs and 3,840 GB of memory.
- Maximum 40 Gbps networking speed
- 1.8 TB/s GPU and NVIDIA NVSwitch P2P communication
| Category | Server type | GPU | CPU | Memory | GPU Memory | Network Bandwidth |
|---|---|---|---|---|---|---|
| GPU-B300-3 | g3v16b1 | 1 | 16 vCore | 480 GB | 268 GiB | Up to 20 Gbps |
| GPU-B300-3 | g3v32b2 | 2 | 32 vCore | 960 GB | 536 GiB | Up to 20 Gbps |
| GPU-B300-3 | g3v64b4 | 4 | 64 vCore | 1,920 GB | 1,072 GiB | Up to 40 Gbps |
| GPU-B300-3 | g3v128b8 | 8 | 128 vCore | 3,840 GB | 2,144 GiB | Maximum 40 Gbps |
1.2 - Monitoring Metrics
According to Samsung Cloud Platform’s policy, the Cloud Monitoring service is scheduled to be discontinued in September 2026.
Accordingly, after the September 2026 release, resource monitoring of the Samsung Cloud Platform via Cloud Monitoring will no longer be possible.
With the new alternative service, you can continuously perform resource monitoring by leveraging ServiceWatch released in October 2025.
ServiceWatch provides more modern and powerful features, replacing Cloud Monitoring to deliver a seamless monitoring environment.
If you are collecting metrics and logs through the Cloud Monitoring Agent, you need to switch to the ServiceWatch Agent.
For detailed information about ServiceWatch, please refer to ServiceWatch Overview.
Detailed information about ServiceWatch Agent can be found in the ServiceWatch Agent.
GPU Server Monitoring Metrics
The table below shows the monitoring metrics of the GPU server that can be viewed through Cloud Monitoring.
Even without installing the Agent, it provides basic monitoring metrics and the table below. Please check the GPU Server monitoring metrics (provided by default). Additionally, the metrics that can be viewed by installing the Agent are in the table. Please refer to the additional monitoring metrics for GPU Server (Agent installation required).
For detailed usage of Cloud Monitoring, refer to the Cloud Monitoring guide.
| Performance Item Name | Explanation | unit |
|---|---|---|
| Memory Total [Basic] | bytes of usable memory | bytes |
| Memory Used [Basic] | bytes of currently used memory | bytes |
| Memory Swap In [Basic] | bytes of the replaced memory | bytes |
| Memory Swap Out [Basic] | bytes of the replaced memory | bytes |
| Memory Free [Basic] | bytes of unused memory | bytes |
| Disk Read Bytes [Basic] | Read bytes | bytes |
| Disk Read Requests [Basic] | Number of read requests | cnt |
| Disk Write Bytes [Basic] | write bytes | bytes |
| Disk Write Requests [Basic] | Number of write requests | cnt |
| CPU Usage [Basic] | Average system CPU usage over 1 minute | % |
| Instance State [Basic] | Instance status | state |
| Network In Bytes [Basic] | Received bytes | bytes |
| Network In Dropped [Basic] | Incoming packet drop | cnt |
| Network In Packets [Basic] | Number of received packets | cnt |
| Network Out Bytes [Basic] | sent bytes | bytes |
| Network Out Dropped [Basic] | Transmit packet drop | cnt |
| Network Out Packets [Basic] | Number of transmitted packets | cnt |
| Performance Item Name | Explanation | unit |
|---|---|---|
| GPU Count | Number of GPUs | cnt |
| GPU Memory Usage | Memory usage rate | % |
| GPU Memory Used | Memory usage | MB |
| GPU Temperature | GPU temperature | ℃ |
| GPU Usage | utilization | % |
| GPU Usage [Avg] | Overall average GPU utilization (%) | % |
| GPU Power Cap | Maximum power capacity of the GPU | W |
| GPU Power Usage | Current GPU power usage | W |
| GPU Memory Usage [Avg] | GPU Memory Uti. AVG | % |
| GPU Count in use | Number of GPUs in use by jobs on the node | cnt |
| Execution Status for nvidia-smi | Result of running the nvidia-smi command | status |
| Core Usage [IO Wait] | Ratio of CPU time spent in wait state (disk wait) | % |
| Core Usage [System] | Proportion of CPU time spent in kernel space | % |
| Core Usage [User] | Proportion of CPU time spent in user space | % |
| CPU Cores | Number of CPU cores on the host | cnt |
| CPU Usage [Active] | Percentage of CPU time used, excluding Idle and IOWait states | % |
| CPU Usage [Idle] | It is the proportion of CPU time spent in idle state. | % |
| CPU Usage [IO Wait] | The proportion of CPU time spent in a waiting state (disk wait). | % |
| CPU Usage [System] | Percentage of CPU time used by the kernel | % |
| CPU Usage [User] | Percentage of CPU time used in user space. | % |
| CPU Usage/Core [Active] | Percentage of CPU time used other than Idle and IOWait states | % |
| CPU Usage/Core [Idle] | It is the proportion of CPU time spent in idle state. | % |
| CPU Usage/Core [IO Wait] | This is the proportion of CPU time spent in a waiting state (disk wait). | % |
| CPU Usage/Core [System] | Percentage of CPU time used by the kernel | % |
| CPU Usage/Core [User] | Percentage of CPU time used in user space. | % |
| Disk CPU Usage [IO Request] | Proportion of CPU time during which I/O requests to the device were executed | % |
| Disk Queue Size [Avg] | The average queue length of requests executed for the device. | num |
| Disk Read Bytes | The number of bytes read per second from the device. | bytes |
| Disk Read Bytes [Delta Avg] | Average of system.diskio.read.bytes_delta for individual disks | bytes |
| Disk Read Bytes [Delta Max] | Maximum system.diskio.read.bytes_delta of individual disks | bytes |
| Disk Read Bytes [Delta Min] | Minimum system.diskio.read.bytes_delta of individual disks | bytes |
| Disk Read Bytes [Delta Sum] | Sum of the system.diskio.read.bytes_delta of individual disks | bytes |
| Disk Read Bytes [Delta] | Delta of the system.diskio.read.bytes value for each disk | bytes |
| Disk Read Bytes [Success] | Total number of bytes successfully read. | bytes |
| Disk Read Requests | Number of read requests to the disk device per second | cnt |
| Disk Read Requests [Delta Avg] | Average of the system.diskio.read.count_delta for individual disks | cnt |
| Disk Read Requests [Delta Max] | Maximum system.diskio.read.count_delta for individual disks | cnt |
| Disk Read Requests [Delta Min] | Minimum of system.diskio.read.count_delta for individual disks | cnt |
| Disk Read Requests [Delta Sum] | Sum of system.diskio.read.count_delta for individual disks | cnt |
| Disk Read Requests [Success Delta] | Delta of system.diskio.read.count for each disk | cnt |
| Disk Read Requests [Success] | Total number of successful reads | cnt |
| Disk Request Size [Avg] | The average size of requests executed on the device (unit: sectors). | num |
| Disk Service Time [Avg] | Average service time (milliseconds) of input requests executed on the device. | ms |
| Disk Wait Time [Avg] | Average time taken for requests executed on the supported device. | ms |
| Disk Wait Time [Read] | Average disk wait time | ms |
| Disk Wait Time [Write] | Average disk wait time | ms |
| Disk Write Bytes [Delta Avg] | Average of system.diskio.write.bytes_delta for each disk | bytes |
| Disk Write Bytes [Delta Max] | Maximum system.diskio.write.bytes_delta of individual disks | bytes |
| Disk Write Bytes [Delta Min] | Minimum of system.diskio.write.bytes_delta for individual disks | bytes |
| Disk Write Bytes [Delta Sum] | Sum of system.diskio.write.bytes_delta for individual disks | bytes |
| Disk Write Bytes [Delta] | Delta of the system.diskio.write.bytes value for each disk | bytes |
| Disk Write Bytes [Success] | Total number of bytes successfully written. | bytes |
| Disk Write Requests | Number of write requests to the disk device per second | cnt |
| Disk Write Requests [Delta Avg] | Average of system.diskio.write.count_delta for individual disks | cnt |
| Disk Write Requests [Delta Max] | Maximum system.diskio.write.count_delta for individual disks | cnt |
| Disk Write Requests [Delta Min] | Minimum of system.diskio.write.count_delta for individual disks | cnt |
| Disk Write Requests [Delta Sum] | Sum of the system.diskio.write.count_delta of individual disks | cnt |
| Disk Write Requests [Success Delta] | Delta of system.diskio.write.count for each disk | cnt |
| Disk Write Requests [Success] | Total number of successful writes | cnt |
| Disk Writes Bytes | It is the number of bytes per second written to the device. | bytes |
| Filesystem Hang Check | filesystem (local/NFS) hang check (normal:1, abnormal:0) | status |
| Filesystem Nodes | It is the total number of file nodes in the file system. | cnt |
| Filesystem Nodes [Free] | It is the total number of available file nodes in the file system. | cnt |
| Filesystem Size [Available] | Disk space (bytes) available to unauthorized users | bytes |
| Filesystem Size [Free] | Available disk space (bytes) | bytes |
| Filesystem Size [Total] | Total disk space (bytes) | bytes |
| Filesystem Usage | Used disk space percentage | % |
| Filesystem Usage [Avg] | Average of individual filesystem.used.pct values | % |
| Filesystem Usage [Inode] | inode usage | % |
| Filesystem Usage [Max] | Maximum among individual filesystem.used.pct | % |
| Filesystem Usage [Min] | minimum among individual filesystem.used.pct | % |
| Filesystem Usage [Total] | - | % |
| Filesystem Used | Used disk space (bytes) | bytes |
| Filesystem Used [Inode] | inode usage | bytes |
| Memory Free | Total amount of available memory (bytes). | bytes |
| Memory Free [Actual] | Actual usable memory (bytes). | bytes |
| Memory Free [Swap] | Available swap memory. | bytes |
| Memory Total | total memory | bytes |
| Memory Total [Swap] | Total swap memory. | bytes |
| Memory Usage | Percentage of used memory | % |
| Memory Usage [Actual] | Percentage of memory actually used | % |
| Memory Usage [Cache Swap] | cached swap usage rate | % |
| Memory Usage [Swap] | Percentage of used swap memory | % |
| Memory Used | used memory | bytes |
| Memory Used [Actual] | Actual memory used (bytes). | bytes |
| Memory Used [Swap] | Swap memory used. | bytes |
| Collisions | Network collision | cnt |
| Network In Bytes | Number of received bytes | bytes |
| Network In Bytes [Delta Avg] | Average of system.network.in.bytes_delta for individual networks | bytes |
| Network In Bytes [Delta Max] | Maximum of system.network.in.bytes_delta for each network | bytes |
| Network In Bytes [Delta Min] | Minimum system.network.in.bytes_delta for each network | bytes |
| Network In Bytes [Delta Sum] | Sum of system.network.in.bytes_delta for individual networks | bytes |
| Network In Bytes [Delta] | Delta of received byte count | bytes |
| Network In Dropped | Number of deleted packets among incoming packets | cnt |
| Network In Errors | Number of errors during reception | cnt |
| Network In Packets | Number of received packets | cnt |
| Network In Packets [Delta Avg] | Average of system.network.in.packets_delta for each network | cnt |
| Network In Packets [Delta Max] | Maximum of system.network.in.packets_delta for each network | cnt |
| Network In Packets [Delta Min] | Minimum of system.network.in.packets_delta for individual networks | cnt |
| Network In Packets [Delta Sum] | Sum of system.network.in.packets_delta for individual networks | cnt |
| Network In Packets [Delta] | Delta of received packet count | cnt |
| Network Out Bytes | Number of transmitted bytes | bytes |
| Network Out Bytes [Delta Avg] | Average of system.network.out.bytes_delta for each network | bytes |
| Network Out Bytes [Delta Max] | Maximum of system.network.out.bytes_delta for individual networks | bytes |
| Network Out Bytes [Delta Min] | Minimum of system.network.out.bytes_delta for individual networks | bytes |
| Network Out Bytes [Delta Sum] | Sum of system.network.out.bytes_delta for individual networks | bytes |
| Network Out Bytes [Delta] | Delta of transmitted byte count | bytes |
| Network Out Dropped | Number of deleted packets among outgoing packets. | cnt |
| Network Out Errors | Number of errors during transmission | cnt |
| Network Out Packets | Number of transmitted packets | cnt |
| Network Out Packets [Delta Avg] | Average of system.network.out.packets_delta for each network | cnt |
| Network Out Packets [Delta Max] | Maximum of system.network.out.packets_delta for each network | cnt |
| Network Out Packets [Delta Min] | Minimum of system.network.out.packets_delta for each network | cnt |
| Network Out Packets [Delta Sum] | Sum of system.network.out.packets_delta for individual networks | cnt |
| Network Out Packets [Delta] | Delta of transmitted packet count | cnt |
| Open Connections [TCP] | All open TCP connections | cnt |
| Open Connections [UDP] | All open UDP connections | cnt |
| Port Usage | Available port usage rate | % |
| SYN Sent Sockets | Number of sockets in SYN_SENT state (when connecting from local to remote) | cnt |
| Kernel PID Max | kernel.pid_max value | cnt |
| Kernel Thread Max | kernel.threads-max value | cnt |
| Process CPU Usage | Percentage of CPU time consumed by the process since the last update. | % |
| Process CPU Usage/Core | Percentage of CPU time used by the process since the last event. | % |
| Process Memory Usage | Proportion of main memory (RAM) occupied by a process | % |
| Process Memory Used | Resident Set size. The amount of memory a process occupies in RAM. | bytes |
| Process PID | process pid | PID |
| Process PPID | parent process PID | PID |
| Processes [Dead] | Number of dead processes | cnt |
| Processes [Idle] | Number of idle processes | cnt |
| Processes [Running] | Number of running processes | cnt |
| Processes [Sleeping] | Number of sleeping processes | cnt |
| Processes [Stopped] | stopped processes count | cnt |
| Processes [Total] | Total number of processes | cnt |
| Processes [Unknown] | Number of processes with an unsearchable or unknown status | cnt |
| Processes [Zombie] | Zombie processes count | cnt |
| Running Process Usage | process usage rate | % |
| Running Processes | Number of running processes | cnt |
| Running Thread Usage | Thread usage rate | % |
| Running Threads | Total number of threads running in running processes | cnt |
| Context Switches | context switch count (per second) | cnt |
| Load/Core [1 min] | The load over the last 1 minute divided by the number of cores | cnt |
| Load/Core [15 min] | The load over the last 15 minutes divided by the number of cores | cnt |
| Load/Core [5 min] | The load over the last 5 minutes divided by the number of cores | cnt |
| Multipaths [Active] | External storage connection path status = active count | cnt |
| Multipaths [Failed] | External storage connection path status = failed count | cnt |
| Multipaths [Faulty] | External storage connection path status = faulty count | cnt |
| NTP Offset | measured offset of the last sample (the time difference between the NTP server and the local environment) | num |
| Run Queue Length | Execution queue length | num |
| Uptime | OS uptime (milliseconds). | ms |
| Context Switchies | CPU context switch count (per second) | cnt |
| Disk Read Bytes [Sec] | Number of bytes read from a Windows logical disk in 1 second | cnt |
| Disk Read Time [Avg] | Average data read time (seconds) | sec |
| Disk Transfer Time [Avg] | Disk average wait time | sec |
| Disk Usage | Disk usage | % |
| Disk Write Bytes [Sec] | Number of bytes written in one second on a Windows logical disk | cnt |
| Disk Write Time [Avg] | Average data write time (seconds) | sec |
| Pagingfile Usage | Paging file usage | % |
| Pool Used [Non Paged] | Nonpaged Pool usage in kernel memory | bytes |
| Pool Used [Paged] | Paged Pool usage in kernel memory | bytes |
| Process [Running] | Number of currently running processes | cnt |
| Threads [Running] | Number of currently running threads | cnt |
| Threads [Waiting] | Number of threads waiting for processor time | cnt |
1.3 - ServiceWatch Metrics
The GPU Server sends metrics to ServiceWatch. The metrics provided by default monitoring are data collected at 5‑minute intervals. If detailed monitoring is enabled, you can view data collected at 1‑minute intervals.
- The basic and detailed monitoring of the GPU Server are provided with the same metrics as the Virtual Server, and the namespace is also provided as Virtual Server.
- GPU-related metrics are provided through ServiceWatch Agent, and for instructions on collecting metrics using ServiceWatch Agent, refer to the ServiceWatch Agent guide.
Refer to How-to guides > ServiceWatch Detailed Monitoring Activation for how to enable detailed monitoring of the GPU Server.
Basic Metrics
The following are the basic metrics for the Virtual Server namespace.
The indicators whose names are displayed in bold below are the key indicators selected among the basic indicators provided by Virtual Server. The key metrics are used to build service dashboards that are automatically created for each service in ServiceWatch.
Each metric indicates through the user guide which statistical value is meaningful to view for that metric, and among the meaningful statistics, the statistical values shown in bold text are the primary statistics. In the service dashboard, you can view primary metrics using the primary statistical values.
| Performance items | Detailed description | unit | meaningful statistics |
|---|---|---|---|
| Instance State | Instance status display
| None |
|
| CPU Usage | CPU usage | Percent |
|
| Disk Read Bytes | Bytes read from block device (bytes) | Bytes |
|
| Disk Read Requests | Number of read requests on a block device | Count |
|
| Disk Write Bytes | Write capacity (bytes) on block device | Bytes |
|
| Disk Write Requests | Number of write requests on block device | Count |
|
| Network In Bytes | Received bytes (capacity) on the network interface | Bytes |
|
| Network In Dropped | Number of packet drops received on the network interface | Count |
|
| Network In Packets | Number of packets received on the network interface | Count |
|
| Network Out Bytes | Data transmitted on the network interface (bytes) | Bytes |
|
| Network Out Dropped | Number of packet drops transmitted from the network interface | Count |
|
| Network Out Packets | Number of packets transmitted on the network interface | Count |
|
2 - How-to guides
Users can create the service by entering the required GPU Server information and selecting detailed options through the Samsung Cloud Platform Console.
Creating a GPU Server
You can create and use a GPU Server service from the Samsung Cloud Platform Console.
To create a GPU server, follow the steps below.
Click the All Services > Compute > GPU Server menu. You will be taken to the Service Home page of GPU Server.
On the Service Home page, click the GPU Server Creation button. You will be taken to the GPU Server Creation page.
On the GPU Server Creation page, enter the information required to create the service and select detailed options.
- Select the required information in the Image and Version Selection area.
Category Required statusDetailed description Image Required Select the type of Image provided - Standard: Samsung Cloud Platform standard provided Image
- RHEL, Ubuntu
- Custom: User-created Image
- Kubernetes: Image for Kubernetes
- Ubuntu
Image version Required Select the version of the chosen Image - Provides a list of versions for the offered server Image
- For detailed information about the provided server image, see OS and GPU driver versions
Table. GPU Server image and version selection input fields - Standard: Samsung Cloud Platform standard provided Image
- Service Information Input area, enter or select the required information.
Category Required statusDetailed description Number of servers Required Number of GPU Server servers to create concurrently - Only numeric input is allowed, enter a value between 1 and 100
Service Type > Server Type Required GPU Server server type - Indicates the specifications of a GPU-type server, allowing selection of servers with 1, 2, 4, or 8 GPUs
- For detailed information about the server types provided by GPU Server, refer to GPU Server Server Types
Service Type > Planned Compute Selection Resource status with Planned Compute configured - In Use: Number of resources with Planned Compute that are currently in use
- Configured: Number of resources with Planned Compute configured
- Coverage Preview: Amount applied per resource by Planned Compute
- Apply for Planned Compute Service: Navigate to the Planned Compute service application page
- For more details, refer to Apply for Planned Compute
Block Storage Required Configure the Block Storage used by the GPU Server according to its purpose - Basic: The area where the OS is installed and used
- Capacity can be entered in units (the minimum capacity varies depending on the OS image type)
- RHEL: Values between 3 and 1,536 can be entered
- Ubuntu: Values between 3 and 1,536 can be entered
- SSD: High‑performance general volume
- HDD: General volume
- SSD/HDD_KMS: Additional encrypted volume using Samsung Cloud Platform KMS (Key Management System) encryption keys
- Encryption can be applied only at initial creation (cannot be changed after creation)
- Performance degradation occurs when using the SSD_KMS disk type
- SSD_Provisioned: SSD volume with configurable IOPS and throughput
- Capacity can be entered in units (the minimum capacity varies depending on the OS image type)
- Additional: Used when the user needs extra space beyond the OS area
- After selecting Use, enter the storage type and capacity
- To add storage, click the + button (up to 25 can be added); to delete, click the x button
- Capacity can be entered in units, with values between 1 and 1,536
- Since 1 Unit equals 8 GB, this creates 8 GB to 12,288 GB
- SSD: High‑performance general volume
- HDD: General volume
- SSD/HDD_KMS: Additional encrypted volume using Samsung Cloud Platform KMS (Key Management System) encryption keys
- Encryption can be applied only at initial creation (cannot be changed after creation)
- Performance degradation may occur when using the SSD_KMS disk type
- HDD/SSD_MultiAttach: Volume that can be attached to two or more servers
- SSD_Provisioned: SSD volume with configurable IOPS and throughput
- For details on each Block Storage type, refer to Create Block Storage
- Delete on termination: If Delete on Termination is set to Enabled, the volume is terminated together with the server
- Volumes with existing snapshots are not deleted even if Delete on termination is set to Enabled
- A multi‑attach volume can be deleted only when the server being removed is the last remaining server attached to the volume
Max IOPS Required Enter a maximum IOPS value between 5,000~20,000 - Can be set only when disk type is SSD_Provisioned
Max Throughput Required Enter the maximum Throughput value between 250~1,000 - Disk type can be set only when it is SSD_Provisioned
Table. GPU Server Service Configuration Items - In the Required Information Input area, enter or select the necessary information.
Category required or notDetailed description Server name Required Enter a name to distinguish the server when the selected number of servers is 1 - Set the hostname using the entered server name
- Enter within 63 characters using letters, numbers, spaces, and special characters (
-_)
Server name Prefix Required Enter a Prefix to distinguish each server generated when the selected number of servers is 2 or more - Automatically generated in the form of user input value (prefix) + ‘
-#’
- Enter within 59 characters using letters, numbers, spaces, and special characters (
-,_)
Network Settings > Create New Network Port Required Configure the network where the GPU Server will be installed - Select a pre‑created VPC.
- General Subnet: Select a pre‑created general Subnet
- IP can be set to auto‑generate or manual entry; if manual is chosen, the user can directly input the IP
- NAT: Available only when there is a single server and the VPC is attached to an Internet Gateway. Checking Use enables selection of a NAT IP
- NAT IP: Select a NAT IP
- If no NAT IP is available, click the Create New button to generate a Public IP
- Click the Refresh button to view and select the created Public IP
- Creating a Public IP incurs charges according to the Public IP pricing policy
- Local Subnet (optional): Choose Use for a local Subnet
- It is not a required element for creating the service
- A pre‑created local Subnet must be selected
- IP can be set to auto‑generate or manual entry; selecting Manual allows the user to input the IP directly
- Security Group: Settings required to access the server
- Select: Choose a pre‑created Security Group
- Create New: If no applicable Security Group exists, you can create one separately in the Security Group service
- Up to 5 can be selected
- If no Security Group is set, all connections are blocked by default
- You must configure a Security Group to allow required connections
Network Settings > Existing Network Port Assignment Required Set the network where the GPU server will be installed - Select a pre-created VPC
- General Subnet: Select a pre-created general Subnet and Port
- NAT: Available only when there is a single server and the VPC is connected to an Internet Gateway. When you check to use it, you can select a NAT IP.
- NAT IP: Select a NAT IP
- If there is no NAT IP to select, click the Create New button to generate a Public IP
- Click the Refresh button to view and select the created Public IP
- Local Subnet (optional): Select Use for the local Subnet
- Select a pre-created local Subnet and Port
Keypair Required User authentication methods to use when connecting to the server - New creation: Create a new keypair if one is needed
- Refer to Keypair 생성하기 for how to create a new keypair
- Default login accounts by OS
- RHEL: cloud-user
- Ubuntu: ubuntu
Table. Required input fields for GPU Server - Enter or select the required information in the Additional Information Input area.
Category RequiredDetailed description Lock Selection Lock usage setting - When Lock is enabled, it prevents actions such as server termination, start, and stop from being executed, thereby avoiding malfunctions caused by mistakes
Init script Selection Script executed when the server starts - The init script must be written as a Batch script for Windows, a Shell script for Linux, or cloud‑init, depending on the image type.
- Up to 45,000 bytes can be entered
tag Selection Add Tag - Up to 50 can be added per resource
- After clicking the Add Tag button, enter or select Key, Value values
Table. GPU Server additional information input fields
- Select the required information in the Image and Version Selection area.
Summary Check the detailed information and estimated billing amount generated in the panel, and click the Create button.
- Once creation is complete, check the created resources on the GPU Server List page.
Check GPU Server detailed information
GPU Server service allows you to view and edit the full resource list and detailed information. GPU Server Details page consists of Details, Tags, Job History tabs.
To view detailed information about the GPU Server service, follow these steps.
- Click the All Services > Compute > GPU Server menu. You will be taken to the Service Home page of GPU Server.
- On the Service Home page, click the GPU Server menu. You will be taken to the GPU Server List page.
- On the GPU Server List page, click the resource you want to view details for. You will be taken to the GPU Server Details page.
- GPU Server Details page displays status information and additional feature information, and is composed of Details, Tags, Activity History tabs.
- GPU Server Additional Features for detailed information, please refer to GPU Server Management Additional Features.
Category Detailed description GPU Server status Status of user-created GPU Server - Build: State where the Build command has been received
- Building: Build in progress
- Networking: Server creation networking process
- Scheduling: Server creation scheduling process
- Block_Device_Mapping: Connecting Block Storage during server creation
- Spawning: Server creation process is ongoing
- Active: Available state
- Powering_off: State when a stop request is made
- Deleting: Server deletion in progress
- Reboot_Started: Reboot in progress
- Error: Error state
- Migrating: Server migrating to another host
- Reboot: State where the Reboot command has been received
- Rebooting: Rebooting
- Rebuild: State where the Rebuild command has been received
- Rebuilding: State during Rebuild request
- Rebuild_Spawning: Rebuild process is ongoing
- Resize: State where the Resize command has been received
- Resizing: Resizing in progress
- Resize_Prep: State when a server type change is requested
- Resize_Migrating: Server is migrating to another host while resizing
- Resize_Migrated: Server has completed migration to another host during resizing
- Resize_Finish: Resize completed
- Revert_Resize: Resize or migration of the server failed for some reason. The target server is cleaned up and the original server is restarted
- Shutoff: State when powering off is completed
- Verity_Resize: After Resize_Prep following a server type change request, the server type is confirmed or can be reverted
- Resize_Reverting: State when a server type revert is requested
- Resize_Confirming: State confirming the server’s Resize request
Server control Buttons to change server status - Start: start a stopped server
- Stop: stop a running server
- Restart: restart a running server
Image generation Create a custom user image from the current server image Console log View console logs of the current server - You can view the console logs output by the current server. For more details, see Check console logs.
Create dump Generate a dump of the current server - The dump file is created inside the GPU Server
- For detailed dump creation instructions, refer to Create Dump
Rebuild All data and settings of the existing server are deleted, and a new server is configured - For detailed information, see Execute Rebuild.
Service termination Cancel service button Table. GPU Server status information and additional features
Detailed Information
GPU Server List page lets you view detailed information of the selected resource and edit it if needed.
| Category | Detailed description |
|---|---|
| service | Service name |
| Resource Type | Resource Type |
| SRN | Unique resource ID in Samsung Cloud Platform
|
| Resource name | Resource Name
|
| Resource ID | Unique resource ID in the service |
| constructor | User who created the service |
| Creation date and time | Service creation date and time |
| editor | User who edited the service information |
| Modification date | Date and time the service information was modified |
| Server name | Server name |
| Server type | vCPU, memory, GPU information display
|
| image name | Service OS image and version |
| Lock | Indicates whether Lock is used or not
|
| Keypair name | Server authentication information set by the user |
| Planned Compute | Resource status with Planned Compute configured
|
| LLM Endpoint | URL for using LLM
|
| ServiceWatch Detailed Monitoring | When enabled, data monitoring is possible in the ServiceWatch service
|
| Network | Network information of the GPU Server
|
| Local Subnet | Local Subnet information of the GPU Server
|
| Block Storage | Information of Block Storage attached to the server
|
tag
GPU Server List page allows you to view the tag information of the selected resource, and you can add, modify, or delete it.
| Category | Detailed description |
|---|---|
| Tag list | Tag list
|
Job History
You can view the job history of the selected resource on the GPU Server List page.
| Category | Detailed description |
|---|---|
| Task History List | Resource Change History
|
Control GPU Server Operation
If you need to control the operation of a created GPU Server resource, you can perform the task on the GPU Server List or GPU Server Details page. You can start, stop, and restart a running server.
Getting Started with GPU Server
You can start a shutoff GPU Server. To start a GPU Server, follow the steps below.
- Click the All Services > Compute > GPU Server menu. You will be taken to the Service Home page of the GPU Server.
- On the Service Home page, click the GPU Server menu. You will be taken to the GPU Server List page.
- On the GPU Server List page, click the resource you want to start among the shutoff servers, and navigate to the GPU Server Details page.
- GPU Server List page allows you to Start each resource via the right More button.
- After selecting multiple servers with checkboxes, you can control multiple servers simultaneously using the Start button at the top.
- GPU Server Details page, click the Start button at the top to start the server. In the Status Display section, verify the updated server status.
- When the GPU Server startup is complete, the server status changes from Shutoff to Active.
- For detailed information about the GPU Server status, refer to GPU Server detailed information.
Stopping the GPU Server
You can stop a running (Active) GPU Server. To stop the GPU Server, follow the steps below.
- Click the All Services > Compute > GPU Server menu. You will be taken to the Service Home page of GPU Server.
- On the Service Home page, click the GPU Server menu. You will be taken to the GPU Server List page.
- On the GPU Server List page, click the resource to stop among the servers that are running (Active), and navigate to the GPU Server Details page.
- On the GPU Server List page, you can Stop each resource via the right More button.
- After selecting multiple servers with checkboxes, you can control multiple servers simultaneously using the Stop button at the top.
- GPU Server Details page, click the Stop button at the top to start the server. Check the changed server status in the Status Display item.
- When the GPU server shutdown is complete, the server status changes from Active to Shutoff.
- For detailed information about the GPU Server status, please refer to GPU Server Detailed Information.
Restart GPU Server
You can restart the created GPU Server. To restart the GPU Server, follow the steps below.
- Click the All Services > Compute > GPU Server menu. You will be taken to the Service Home page of GPU Server.
- On the Service Home page, click the GPU Server menu. You will be taken to the GPU Server List page.
- On the GPU Server List page, click the resource to restart and go to the GPU Server Details page.
- On the GPU Server list page, you can restart each resource using the right More button.
- After selecting multiple servers with checkboxes, you can control multiple servers simultaneously using the Restart button at the top.
- GPU Server Details on the page, click the Restart button at the top to start the server. Check the updated server status in the Status Display field.
- During a GPU Server restart, the server status goes through Rebooting and finally changes to Active.
- For detailed information about the GPU Server status, refer to GPU Server detailed information.
GPU Server resource management
If you need server control and management functions for the created GPU Server resources, you can perform the tasks on the GPU Server Resource List or GPU Server Details page.
Create Image
You can create an image of a running GPU server.
This guide explains how to create a custom user image from the image of a running GPU server.
- On the GPU Server List or GPU Server Details page, click the Create Image button to create a user Custom Image.
To create an image of the GPU Server, follow these steps.
Click the All Services > Compute > GPU Server menu. Navigate to the Service Home page of the GPU Server.
On the Service Home page, click the GPU Server menu. You will be taken to the GPU Server List page.
On the GPU Server List page, click the resource to create an Image. You will be taken to the GPU Server Details page.
Click the Create Image button on the GPU Server Details page. You will be taken to the Image Creation page.
- Enter the required information in the Service Information Input area.
Category Required statusDetailed description image name Required Enter the name of the image to be created - using English letters, numbers, spaces, and special characters (
-_) within 200 characters
Table. Image service information input fields - using English letters, numbers, spaces, and special characters (
- Enter the required information in the Service Information Input area.
Check the input information and click the Create button.
- When creation is complete, check the created resources on the All Services > Compute > GPU Server > Image List page.
- When you create an Image, the generated Image is stored in the Object Storage used as internal storage. Therefore, Object Storage usage fees are applied.
- Since the file system of an image created from an active GPU server cannot be guaranteed to be intact, it is recommended to stop the server before creating the image.
Enabling detailed monitoring for ServiceWatch
By default, the GPU Server is linked to the basic monitoring of the ServiceWatch and Virtual Server namespaces. You can enable detailed monitoring as needed to identify operational issues more quickly and take action. For detailed information about ServiceWatch, see ServiceWatch Overview.
To enable detailed ServiceWatch monitoring on the GPU Server, follow these steps.
- Click the All Services > Compute > GPU Server menu. You will be taken to the Service Home page of GPU Server.
- On the Service Home page, click the GPU Server menu. You will be taken to the GPU Server List page.
- On the GPU Server List page, click the resource to enable ServiceWatch detailed monitoring. You will be taken to the GPU Server Details page.
- On the GPU Server Details page, click the ServiceWatch detailed monitoring Edit button. You will be taken to the ServiceWatch Detailed Monitoring Edit popup.
- ServiceWatch Detailed Monitoring Edit In the popup window, select Enable, review the instructions, and click the Confirm button.
- On the GPU Server Details page, check the ServiceWatch detailed monitoring items.
Disable detailed monitoring of ServiceWatch
To disable detailed monitoring of ServiceWatch on the GPU Server, follow these steps.
- Click the All Services > Compute > GPU Server menu. You will be taken to the Service Home page of GPU Server.
- On the Service Home page, click the GPU Server menu. You will be taken to the GPU Server List page.
- GPU Server List page, click the resource to disable ServiceWatch detailed monitoring. Navigate to the GPU Server Details page.
- GPU Server Details page, click the ServiceWatch detailed monitoring Edit button. You will be taken to the ServiceWatch detailed monitoring Edit popup.
- ServiceWatch Detailed Monitoring Edit In the popup window, after deselecting Enabled, review the guidance message and click the Confirm button.
- On the GPU Server Details page, view the ServiceWatch detailed monitoring items.
GPU Server Management Additional Features
GPU Server can view Console logs, generate Dumps, and perform Rebuilds for server management. To view Console logs, generate Dumps, or Rebuild a GPU Server, follow the steps below.
Check console log
You can view the current console log of the GPU Server.
To view the console logs of the GPU server, follow these steps.
- Click the All Services > Compute > GPU Server menu. You will be taken to the GPU Server’s Service Home page.
- On the Service Home page, click the GPU Server menu. You will be taken to the GPU Server List page.
- On the GPU Server List page, click the resource to view the console log. You will be taken to the GPU Server Details page.
- On the GPU Server Details page, click the Console Log button. You will be taken to the Console Log popup.
- Console Log Verify the console log displayed in the popup window.
Create Dump
To create a dump file on the GPU Server, follow these steps.
- Click the All Services > Compute > GPU Server menu. You will be taken to the Service Home page of GPU Server.
- On the Service Home page, click the GPU Server menu. You will be taken to the GPU Server List page.
- GPU Server List page, click the resource to view detailed information. You will be taken to the GPU Server Details page.
- On the GPU Server Details page, click the Create Dump button.
- The dump file is created inside the GPU server.
Perform Rebuild
You can delete all data and settings of the existing GPU Server and rebuild it on a new server.
Follow these steps to perform a Rebuild of the GPU Server.
- Click the All Services > Compute > GPU Server menu. You will be taken to the Service Home page of GPU Server.
- On the Service Home page, click the GPU Server menu. You will be taken to the GPU Server List page.
- GPU Server List page, click the resource to perform Rebuild. GPU Server Details page will be displayed.
- On the GPU Server Details page, click the Rebuild button.
- During a GPU Server Rebuild, the server status changes to Rebuilding, and when the Rebuild is complete, it returns to its state before the Rebuild.
- For detailed information about the GPU Server status, refer to Check GPU Server details.
Terminate GPU Server
If you terminate an unused GPU Server, you can reduce operating costs. However, terminating a GPU Server may cause the running service to stop immediately, so you should carefully consider the impact of service interruption before proceeding with the termination.
To cancel the GPU Server, follow the steps below.
- Click the All Services > Compute > GPU Server menu. Go to the GPU Server’s Service Home page.
- On the Service Home page, click the GPU Server menu. You will be taken to the GPU Server List page.
- On the GPU Server List page, select the resource to cancel and click the Cancel Service button.
- The termination of attached storage depends on the Delete on termination setting, so refer to Termination constraints.
- After termination is complete, check on the GPU Server List page whether the resources have been terminated.
Cancellation constraints
When a GPU Server termination request cannot be processed, a popup will provide guidance. Please refer to the cases below.
- If File Storage is connected please disconnect the File Storage connection first.
- When the LB Pool is connected, please disconnect the LB Pool connection first.
- If Lock is set please change the Lock setting to disabled and try again.
Termination of attached storage depends on the Delete on termination setting.
- Whether the volume is deleted also depends on the Delete on termination setting.
- Delete on termination If not set: Even if you terminate the GPU Server, the volume will not be deleted.
- When Delete on termination is set: If you terminate the GPU Server, the associated volume will be deleted.
- Volumes that have snapshots will not be deleted even if Delete on termination is set.
- A Multi‑attach volume can be deleted only when the server being deleted is the last remaining server attached to the volume.
2.1 - Manage Image
Users can create the service by entering the required information for the Image service within the GPU Server service and selecting detailed options through the Samsung Cloud Platform Console.
Create Image
You can create an image of a running GPU Server. To create an image of a GPU Server, please refer to Image Creation.
Check Image detailed information
Image service allows you to view and edit the full resource list and detailed information. The Image Details page consists of Detailed Information, Tags, Operation History tabs.
To view detailed information of the Image service, follow these steps.
- Click the All Services > Compute > GPU Server menu. You will be taken to the Service Home page of GPU Server.
- On the Service Home page, click the Image menu. You will be taken to the Image list page.
- On the Image List page, click the resource to view detailed information. You will be taken to the Image Detail page.
- Image Details page displays status information and additional feature information, and consists of Details, Tags, Activity Log tabs.
Category Detailed description Image status Status of user-created Image - Active: Available state
- Queued: Image has been uploaded and is waiting for processing after creation
- Importing: Image has been uploaded and is currently being processed after creation
Share with another account Image can be shared with another Account - The Image’s Visibility must be set to Shared in order to be shared with another Account
Delete image Button to delete the Image - Once the Image is deleted, it cannot be restored
Table. GPU Server Image status information and additional features
- Image Details page displays status information and additional feature information, and consists of Details, Tags, Activity Log tabs.
Detailed Information
Image list page lets you view detailed information of the selected resource and modify it if necessary.
| Category | Detailed description |
|---|---|
| service | Service name |
| Resource Type | Resource Type |
| SRN | Unique resource ID in Samsung Cloud Platform
|
| Resource name | Image name |
| Resource ID | Image ID |
| constructor | User who created the Image |
| Creation date and time | Image creation timestamp |
| editor | User who edited the Image |
| Modification date | Image modification timestamp |
| image name | Image name |
| Minimum disk | Minimum disk capacity (GB) of the Image
|
| Minimum RAM | Minimum RAM size (GB) of the Image |
| OS type | OS type of the image |
| OS hash algorithm | OS hash algorithm method |
| Visibility | Display access permissions for the image
|
| Protected | Select whether image deletion is prohibited
|
| image file URL | Image file URL uploaded when generating an image
|
| Sharing status | Current status of sharing images with another Account
|
tag
On the Image List page, you can view the tag information of the selected resource and add, modify, or delete it.
| Category | Detailed description |
|---|---|
| Tag list | Tag list
|
Job History
You can view the operation history of the selected resource on the Image List page.
| Category | Detailed description |
|---|---|
| Task History List | Resource Change History
|
Image Resource Management
Describes the control and management functions of the generated Image.
Share to another Account
To share an Image with another Account, follow the steps below.
- Log in to the account to be shared and click the All Services > Compute > GPU Server menu. Go to the GPU Server’s Service Home page.
- On the Service Home page, click the Image menu. You will be taken to the Image List page.
- On the Image List page, click the Image you want to control. You will be taken to the Image Details page.
- Click the Share to another Account button. Navigate to the Share image to another Account page.
- Share with another Account feature allows you to share an Image with another Account. To share an Image with another Account, the Image’s Visibility must be Shared.
- Share image to another Account page, enter the required information, and click the Done button.
Category RequiredDetailed description image name - Name of the image to share - Input not allowed
Image ID - Shareable image ID - Input not allowed
Shared Account ID Required Enter another Account ID to share - English letters, numbers, special characters
-within 64 characters
Table. Required input fields for sharing images to another Account - You can view the information in the sharing status of the Image Details page.
- When the request is first made, the status is Pending, and it changes to Accepted once approval is completed by the account receiving the share.
Receive sharing from another Account
To receive an Image shared from another Account, follow these steps.
- Log in to the account to be shared and click the All Services > Compute > GPU Server menu. Navigate to the GPU Server’s Service Home page.
- On the Service Home page, click the Image menu. You will be taken to the Image List page.
- On the Image List page, click the Receive Image Share button. You will be taken to the Receive Image Share popup.
- Receive Image Sharing In the popup window, enter the Image’s resource ID you want to receive, and click the Confirm button.
- When image sharing is complete, you can view the shared Image in the Image list.
Delete Image
You can delete unused Images. However, since an Image cannot be recovered after deletion, you should carefully consider the impact before performing the deletion.
To delete the Image, follow these steps.
- Click the All Services > Compute > GPU Server menu. Go to the GPU Server’s Service Home page.
- On the Service Home page, click the Image menu. You will be taken to the Image List page.
- Image list page, select the resource to delete, and click the Delete button.
- On the Image List page, select multiple Image check boxes and click the Delete button at the top of the resource list.
- After deletion is complete, verify on the Image list page that the resource has been removed.
2.2 - Manage Keypair
Users can create the service by entering the required Keypair information within the GPU Server service and selecting detailed options through the Samsung Cloud Platform Console.
Create a Keypair
You can create and use the Keypair service while using the GPU Server service in the Samsung Cloud Platform Console.
To create a keypair, follow these steps.
- Click the All Services > Compute > GPU Server menu. You will be taken to the Service Home page of GPU Server.
- On the Service Home page, click the Keypair menu. You will be taken to the Keypair List page.
- On the Keypair List page, click the Keypair Create button. You will be taken to the Keypair Create page.
- Enter the required information in the Service Information Input area.
Category RequiredDetailed description Keypair name Required Enter the name of the Keypair to create - using English letters, numbers, spaces, and special characters (
-,_) within 255 characters
Keypair type Required ssh Table. Keypair service information input fields - using English letters, numbers, spaces, and special characters (
- Additional Information Input area, please enter or select the required information.
Category Required statusDetailed description tag Selection Add Tag - Up to 50 can be added per resource
- After clicking the Add Tag button, enter or select Key, Value values
Table. Keypair additional information input fieldsCaution- After creation is complete, you can download the Key only once. Since reissuance is not possible, make sure it has been downloaded.
- Store the downloaded Private Key in a safe place.
- Enter the required information in the Service Information Input area.
- Check the input information and click the Create button.
- After creation is complete, check the created resources on the Keypair List page.
View detailed information of the Keypair
The Keypair service allows you to view and edit the full resource list and detailed information. Keypair Details page consists of Details, Tags, Activity Log tabs.
To view detailed information about a keypair, follow these steps.
- Click the All Services > Compute > GPU Server menu. You will be taken to the Service Home page of GPU Server.
- On the Service Home page, click the Keypair menu. You will be taken to the Keypair List page.
- On the Keypair List page, click the resource to view its details. You will be taken to the Keypair Details page.
- Keypair Details page displays status information and additional feature information, and consists of Details, Tags, Activity Log tabs.
Detailed Information
Keypair List page allows you to view detailed information of the selected resource and edit the information if needed.
| Category | Detailed description |
|---|---|
| service | Service name |
| Resource Type | Resource Type |
| SRN | Unique resource ID in Samsung Cloud Platform
|
| Resource name | Keypair name |
| Resource ID | Keypair’s unique resource ID |
| constructor | User who created the keypair |
| Creation date and time | Keypair creation timestamp |
| editor | User who modified the keypair information |
| Modification date | Date and time the keypair information was modified |
| Keypair name | Keypair name |
| Fingerprint | A unique value for identifying the key |
| User ID | User ID of the keypair creator |
| public key | Public key information |
tag
On the Keypair List page, you can view the tag information of the selected resource, and add, modify, or delete it.
| Category | Detailed description |
|---|---|
| Tag list | Tag list
|
Job History
On the Keypair List page, you can view the operation history of the selected resource.
| Category | Detailed description |
|---|---|
| Task History List | Resource Change History
|
Keypair Resource Management
Describes the control and management functions of a keypair.
Get public key
To retrieve the public key, follow these steps.
Click the All Services > Compute > GPU Server menu. You will be taken to the Service Home page of GPU Server.
On the Service Home page, click the Keypair menu. You will be taken to the Keypair List page.
On the Keypair List page, click the More button at the top and then click the Import Public Key button. You will be taken to the Import Public Key page.
- Enter or select the required information in the Required Information Input area.
Category RequiredDetailed description Keypair name Required Name of the Keypair to create Keypair type Required ssh public key Required Enter public key - Load file: Select the Attach file button to attach the public key file
- Only files with the following extension (.pem) can be attached
- Enter public key: Paste the copied public key value
- The public key value can be copied from the Keypair Details page
Table. Required input fields for retrieving the public key - Load file: Select the Attach file button to attach the public key file
- Enter or select the required information in the Required Information Input area.
Review the entered information and click the Complete button.
- Once creation is complete, check the created resources on the Keypair List page.
Delete Keypair
You can delete unused Keypairs. However, once a Keypair is deleted it cannot be recovered, so please review the impact thoroughly beforehand before proceeding with deletion.
To delete a keypair, follow these steps.
- Click the All Services > Compute > GPU Server menu. Go to the Service Home page of GPU Server.
- On the Service Home page, click the Keypair menu. You will be taken to the Keypair List page.
- On the Keypair List page, select the resource to delete, and click the Delete button.
- On the Keypair List page, select multiple Keypair check boxes and click the Delete button at the top of the resource list.
- After deletion is complete, check the Keypair List page to confirm that the resource has been removed.
2.3 - Use Multi-instance GPU on GPU Server
After creating a GPU Server, you can enable the MIG (Multi-instance GPU) feature on the GPU Server’s VM (Guest OS) and create an instance for use.
NVIDIA Multi-instance GPU Introduction
NVIDIA Multi-instance GPU (hereafter referred to as MIG) supports safely partitioning a GPU into GPU instances and running CUDA applications starting with the NVIDIA Ampere architecture. Through this, multiple users can each utilize different GPU resources to achieve optimal GPU utilization. This feature is especially useful for workloads that do not fully utilize the GPU’s computing capacity, and users can run multiple workloads in parallel to maximize utilization.
Using Multi-instance GPU feature
To use the MIG feature, create an NVIDIA GPU Server on the Samsung Cloud Platform, then enable and disable MIG. The order of applying and removing MIG is as follows.
- MIG can be used on Samsung Cloud Platform’s next-generation GPU Server or MNGC (Multi-node GPU Cluster).
- For system requirements to use MIG, refer to the NVIDIA Multi-Instance GPU User Guide.
Applying and Using MIG
After activating the MIG and creating an Instance to assign tasks, the tasks proceed in the following order.
Activate MIG
Check the GPU status on the VM Instance (GuestOS) before applying MIG.
- Check whether MIG mode is Disabled.Color mode
$ nvidia-smi Mon Sep 27 08:37:08 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 | | -------------------------------+----------------------+---------------------- | | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | | ===============================+======================+====================== | | 0 NVDIA A100-SXM... Off | 00000000:05:00.0 Off | 0 | | N/A 32C P0 59W / 400W | 0MiB / 81251MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | | ============================================================================= | | No running processes found | +-----------------------------------------------------------------------------+$ nvidia-smi Mon Sep 27 08:37:08 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 | | -------------------------------+----------------------+---------------------- | | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | | ===============================+======================+====================== | | 0 NVDIA A100-SXM... Off | 00000000:05:00.0 Off | 0 | | N/A 32C P0 59W / 400W | 0MiB / 81251MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | | ============================================================================= | | No running processes found | +-----------------------------------------------------------------------------+Code block. nvidia-smi command - Check GPU disabled status (1) Color mode$ nvidia-smi –L GPU 0: NVIDIA A100-SXM-80GB (UUID: GPU-c956838f-494a-92b2-6818-56eb28fe25e0)$ nvidia-smi –L GPU 0: NVIDIA A100-SXM-80GB (UUID: GPU-c956838f-494a-92b2-6818-56eb28fe25e0)Code block. nvidia-smi command - Check GPU disabled status (2)
- Check whether MIG mode is Disabled.
Enable MIG (Enable) for each GPU on the VM Instance (GuestOS) and reboot the VM Instance.
Color mode$ nvidia-smi –I 0 –mig 1 Enabled MIG mode for GPU 00000000:05:00.0 All done. # reboot$ nvidia-smi –I 0 –mig 1 Enabled MIG mode for GPU 00000000:05:00.0 All done. # rebootCode block. nvidia-smi command - enable MIG
When using a GPU and configuring MIG, you may encounter the following warning message. If the warning appears, check whether any programs are running on the GPU.
Warning: MIG mode is in pending enable state for GPU 00000000:05:00.0: In use by another client. 00000000:05:00.0 is currently being used by one or more other processes (e.g. CUDA application or a monitoring application such as another instance of nvidia-smi).
- Check the GPU status after applying MIG on the VM Instance(GuestOS).
- Check whether MIG mode is Enabled.Color mode
$ nvidia-smi Mon Sep 27 09:44:33 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 | | -------------------------------+----------------------+---------------------- | | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | | ===============================+======================+====================== | | 0 NVDIA A100-SXM... Off | 00000000:05:00.0 Off | On | | N/A 32C P0 59W / 400W | 0MiB / 81251MiB | 0% Default | | | | Enabled | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | MIG devices: | +-----------------------------------------------------------------------------+ | GPU GI CI MIG | Memory-Usage | Vol | Shared | | ID ID Dev | BAR1-Usage | SM Unc | CE ENC DEC OFA JPG | | | | ECC | | | ============================================================================= | | No MIG devices found | +-----------------------------------------------------------------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | | ============================================================================= | | No running processes found | +-----------------------------------------------------------------------------+$ nvidia-smi Mon Sep 27 09:44:33 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 | | -------------------------------+----------------------+---------------------- | | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | | ===============================+======================+====================== | | 0 NVDIA A100-SXM... Off | 00000000:05:00.0 Off | On | | N/A 32C P0 59W / 400W | 0MiB / 81251MiB | 0% Default | | | | Enabled | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | MIG devices: | +-----------------------------------------------------------------------------+ | GPU GI CI MIG | Memory-Usage | Vol | Shared | | ID ID Dev | BAR1-Usage | SM Unc | CE ENC DEC OFA JPG | | | | ECC | | | ============================================================================= | | No MIG devices found | +-----------------------------------------------------------------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | | ============================================================================= | | No running processes found | +-----------------------------------------------------------------------------+Code block. nvidia-smi command - Check GPU activation status (1) Color mode$ nvidia-smi –L GPU 0: NVIDIA A100-SXM-80GB (UUID: GPU-c956838f-494a-92b2-6818-56eb28fe25e0)$ nvidia-smi –L GPU 0: NVIDIA A100-SXM-80GB (UUID: GPU-c956838f-494a-92b2-6818-56eb28fe25e0)Code block. nvidia-smi command - Check GPU activation status (2)
- Check whether MIG mode is Enabled.
GPU Instance creation
If you have enabled MIG and verified its status, you can create a GPU Instance.
- Check the list of MIG GPU Instance profiles that can be created.Color mode
$ nvidia-smi mig -i [GPU ID] -lgip$ nvidia-smi mig -i [GPU ID] -lgipCode block. nvidia-smi command - view MIG GPU Instance profile list
$ nvidia-smi mig -i 0 -lgip
+-----------------------------------------------------------------------------+
| GPU instance profiles: |
| GPU Name ID Instances Memory P2P SM DEC ENC |
| Free/Total GiB CE JPEG OFA |
| ============================================================================= |
| 0 MIG 1g.10gb 19 7/7 9.50 No 14 0 0 |
| 1 0 0 |
+-----------------------------------------------------------------------------+
| 0 MIG 1g.10gb+me 20 1/1 9.50 No 14 0 0 |
| 1 1 1 |
+-----------------------------------------------------------------------------+
| 0 MIG 2g.20gb 14 3/3 19.50 No 28 1 0 |
| 2 0 0 |
+-----------------------------------------------------------------------------+
| 0 MIG 3g.40gb 9 2/2 39.50 No 42 2 0 |
| 3 0 0 |
+-----------------------------------------------------------------------------+
| 0 MIG 4g.40gb 5 1/1 39.50 No 56 2 0 |
| 4 0 0 |
+-----------------------------------------------------------------------------+
| 0 MIG 7g.80gb 0 1/1 79.25 No 98 0 0 |
| 7 1 1 |
+-----------------------------------------------------------------------------+$ nvidia-smi mig -i 0 -lgip
+-----------------------------------------------------------------------------+
| GPU instance profiles: |
| GPU Name ID Instances Memory P2P SM DEC ENC |
| Free/Total GiB CE JPEG OFA |
| ============================================================================= |
| 0 MIG 1g.10gb 19 7/7 9.50 No 14 0 0 |
| 1 0 0 |
+-----------------------------------------------------------------------------+
| 0 MIG 1g.10gb+me 20 1/1 9.50 No 14 0 0 |
| 1 1 1 |
+-----------------------------------------------------------------------------+
| 0 MIG 2g.20gb 14 3/3 19.50 No 28 1 0 |
| 2 0 0 |
+-----------------------------------------------------------------------------+
| 0 MIG 3g.40gb 9 2/2 39.50 No 42 2 0 |
| 3 0 0 |
+-----------------------------------------------------------------------------+
| 0 MIG 4g.40gb 5 1/1 39.50 No 56 2 0 |
| 4 0 0 |
+-----------------------------------------------------------------------------+
| 0 MIG 7g.80gb 0 1/1 79.25 No 98 0 0 |
| 7 1 1 |
+-----------------------------------------------------------------------------+- After creating a MIG GPU Instance, check it.
Create GPU Instance
Color mode$ nvidia-smi mig -i [GPU ID] -cgi [Profile ID]$ nvidia-smi mig -i [GPU ID] -cgi [Profile ID]code block. nvidia-smi command - GPU Instance creation Color mode$ nvidia-smi mig -i 0 -cgi 0 Successfully created GPU instance ID 0 on GPU 0 using profile MIG 7g.80gb (ID 0)$ nvidia-smi mig -i 0 -cgi 0 Successfully created GPU instance ID 0 on GPU 0 using profile MIG 7g.80gb (ID 0)Code block. nvidia-smi command - Example of creating a GPU Instance Check GPU Instance
Color mode$ nvidia-smi mig -i [GPU ID] -lgi$ nvidia-smi mig -i [GPU ID] -lgicode block. nvidia-smi command - check GPU Instance Color mode$ nvidia-smi mig -i 0 -lgi +--------------------------------------------------------+ | GPU instances: | | GPU Name Profile Instance Placement | | ID ID Start:Size | | ======================================================== | | 0 MIG 7g.80gb 0 0 0:8 | +--------------------------------------------------------+$ nvidia-smi mig -i 0 -lgi +--------------------------------------------------------+ | GPU instances: | | GPU Name Profile Instance Placement | | ID ID Start:Size | | ======================================================== | | 0 MIG 7g.80gb 0 0 0:8 | +--------------------------------------------------------+Code block. nvidia-smi command - example of checking GPU Instance
Compute Instance creation
If you have created a GPU Instance, you can create a Compute Instance.
Check the MIG Compute Instance profiles you can create.
Color mode$ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] -lcip$ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] -lcipCode block. nvidia-smi command - Check MIG Compute Instance profile Color mode$ nvidia-smi mig -i 0 -gi 0 -lcip +---------------------------------------------------------------------------------+ | Compute instance profiles: | | GPU GPU Name Profile Instances Exclusive Shared | | GPU Instance ID Free/Total SM DEC ENC OFA | | ID CE JPEG | | ================================================================================= | | 0 0 MIG 1c.7g.80gb 0 7/5 14 5 0 1 | | 7 1 | +---------------------------------------------------------------------------------+ | 0 0 MIG 2c.7g.80gb 1 3/3 28 5 0 1 | | 7 1 | +---------------------------------------------------------------------------------+ | 0 0 MIG 3c.7g.80gb 2 2/2 42 5 0 1 | | 7 1 | +---------------------------------------------------------------------------------+ | 0 0 MIG 4c.7g.80gb 3 1/1 56 5 0 1 | | 7 1 | +---------------------------------------------------------------------------------+ | 0 0 MIG 7g.80gb 4* 1/1 98 5 0 1 | | 7 1 | +---------------------------------------------------------------------------------+$ nvidia-smi mig -i 0 -gi 0 -lcip +---------------------------------------------------------------------------------+ | Compute instance profiles: | | GPU GPU Name Profile Instances Exclusive Shared | | GPU Instance ID Free/Total SM DEC ENC OFA | | ID CE JPEG | | ================================================================================= | | 0 0 MIG 1c.7g.80gb 0 7/5 14 5 0 1 | | 7 1 | +---------------------------------------------------------------------------------+ | 0 0 MIG 2c.7g.80gb 1 3/3 28 5 0 1 | | 7 1 | +---------------------------------------------------------------------------------+ | 0 0 MIG 3c.7g.80gb 2 2/2 42 5 0 1 | | 7 1 | +---------------------------------------------------------------------------------+ | 0 0 MIG 4c.7g.80gb 3 1/1 56 5 0 1 | | 7 1 | +---------------------------------------------------------------------------------+ | 0 0 MIG 7g.80gb 4* 1/1 98 5 0 1 | | 7 1 | +---------------------------------------------------------------------------------+Code block. Example of MIG Compute Instance profile list Create and verify a MIG Compute Instance.
- MIG Compute Instance creationColor mode
$ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] -cci [Compute Profile ID]$ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] -cci [Compute Profile ID]code block. nvidia-smi command - Create MIG Compute Instance Color mode$ nvidia-smi mig -i 0 -gi 0 -cci 4 Successfully created compute instance ID 0 on GPU instance ID 0 using profile MIG 7g.80gb(ID 4)$ nvidia-smi mig -i 0 -gi 0 -cci 4 Successfully created compute instance ID 0 on GPU instance ID 0 using profile MIG 7g.80gb(ID 4)Code block. nvidia-smi command - Example of creating a MIG Compute Instance - Check MIG Compute InstanceColor mode
$ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] –lci$ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] –lcicode block. nvidia-smi command - check MIG Compute Instance Color mode$ nvidia-smi mig -i 0 -gi 0 –lci +-----------------------------------------------------------------+ | Compute instance profiles: | | GPU GPU Name Profile Instances Placement | | GPU Instance ID ID Start:Size | | ID | | ================================================================= | | 0 0 MIG 7g.80gb 4 0 0:7 | +-----------------------------------------------------------------+$ nvidia-smi mig -i 0 -gi 0 –lci +-----------------------------------------------------------------+ | Compute instance profiles: | | GPU GPU Name Profile Instances Placement | | GPU Instance ID ID Start:Size | | ID | | ================================================================= | | 0 0 MIG 7g.80gb 4 0 0:7 | +-----------------------------------------------------------------+Code block. Example of checking MIG Compute Instance Color mode$ nvidia-smi –L GPU 0: NVIDIA A100-SXM-80GB (UUID: GPU-c956838f-494a-92b2-6818-56eb28fe25e0) MIG 7g.80gb Device 0: (UUID: MIG-53e20040-758b-5ecb-948e-c626d03a9a32)$ nvidia-smi –L GPU 0: NVIDIA A100-SXM-80GB (UUID: GPU-c956838f-494a-92b2-6818-56eb28fe25e0) MIG 7g.80gb Device 0: (UUID: MIG-53e20040-758b-5ecb-948e-c626d03a9a32)Code block. nvidia-smi command - Check GPU status (1) Color mode$ nvidia-smi Mon Sep 27 09:52:17 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 | | -------------------------------+----------------------+---------------------- | | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | | ===============================+======================+====================== | | 0 NVDIA A100-SXM... Off | 00000000:05:00.0 Off | On | | N/A 32C P0 49W / 400W | 0MiB / 81251MiB | N/A Default | | | | Enabled | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | MIG devices: | +-----------------------------------------------------------------------------+ | GPU GI CI MIG | Memory-Usage | Vol | Shared | | ID ID Dev | BAR1-Usage | SM Unc | CE ENC DEC OFA JPG | | | | ECC | | | ============================================================================= | | 0 0 0 0 | 0MiB / 81251MiB | 98 0 | 7 0 5 1 1 | | | 1MiB / 13107... | | | +-----------------------------------------------------------------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | | ============================================================================= | | No running processes found | +-----------------------------------------------------------------------------+$ nvidia-smi Mon Sep 27 09:52:17 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 | | -------------------------------+----------------------+---------------------- | | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | | ===============================+======================+====================== | | 0 NVDIA A100-SXM... Off | 00000000:05:00.0 Off | On | | N/A 32C P0 49W / 400W | 0MiB / 81251MiB | N/A Default | | | | Enabled | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | MIG devices: | +-----------------------------------------------------------------------------+ | GPU GI CI MIG | Memory-Usage | Vol | Shared | | ID ID Dev | BAR1-Usage | SM Unc | CE ENC DEC OFA JPG | | | | ECC | | | ============================================================================= | | 0 0 0 0 | 0MiB / 81251MiB | 98 0 | 7 0 5 1 1 | | | 1MiB / 13107... | | | +-----------------------------------------------------------------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | | ============================================================================= | | No running processes found | +-----------------------------------------------------------------------------+Code block. nvidia-smi command - Check GPU status (2)
- MIG Compute Instance creation
Using MIG
- Use the MIG Instance to perform the Job.
- Example of task executionColor mode
$ docker run --gpus '"device=[GPU ID]:[MIG ID]"' -rm nvcr.io/nvidia/cuda nvidia-smi$ docker run --gpus '"device=[GPU ID]:[MIG ID]"' -rm nvcr.io/nvidia/cuda nvidia-smiCode block. Task execution example - You can see an example of the work performed as follows.Color mode
$ docker run --gpus '"device=0:0"' -rm -it --network=host --shm-size=1g --ipc=host -v /root/.ssh/:/root/.ssh ================ == TensorFlow == ================ NVIDIA Release 21.08-tf1 (build 26012104) TensorFlow Version 1.15.5 Container image Copyright (c) 2021, NVIDIA CORPORATION. All right reserved. ... # Run Python process root@d622a93c9281:/workspace# python /workspace/nvidia-examples/cnn/resnet.py --num_iter 100 ... PY 3.8.10 (default, Jun 2 2021, 10:49:15) [GCC 9.4.0] TF 1.15.5 ...$ docker run --gpus '"device=0:0"' -rm -it --network=host --shm-size=1g --ipc=host -v /root/.ssh/:/root/.ssh ================ == TensorFlow == ================ NVIDIA Release 21.08-tf1 (build 26012104) TensorFlow Version 1.15.5 Container image Copyright (c) 2021, NVIDIA CORPORATION. All right reserved. ... # Run Python process root@d622a93c9281:/workspace# python /workspace/nvidia-examples/cnn/resnet.py --num_iter 100 ... PY 3.8.10 (default, Jun 2 2021, 10:49:15) [GCC 9.4.0] TF 1.15.5 ...Code block. Operation result
- Example of task execution
- Check the GPU usage. (Create JOB process)
- When the job runs, you can see that a process is allocated to the MIG device and its utilization increases.Color mode
$ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] -lcip$ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] -lcipcode block. nvidia-smi command - check GPU utilization - You can check the GPU usage as shown below.Color mode
+-----------------------------------------------------------------------------+ | MIG devices: | +-----------------------------------------------------------------------------+ | GPU GI CI MIG | Memory-Usage | Vol | Shared | | ID ID Dev | BAR1-Usage | SM Unc | CE ENC DEC OFA JPG | | | | ECC | | | ============================================================================= | | 0 0 0 0 | 66562MiB / 81251MiB | 98 0 | 7 0 5 1 1 | | | 5MiB / 13107... | | | +-----------------------------------------------------------------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | | ============================================================================= | | 0 0 0 17483 C python 66559MiB | +-----------------------------------------------------------------------------++-----------------------------------------------------------------------------+ | MIG devices: | +-----------------------------------------------------------------------------+ | GPU GI CI MIG | Memory-Usage | Vol | Shared | | ID ID Dev | BAR1-Usage | SM Unc | CE ENC DEC OFA JPG | | | | ECC | | | ============================================================================= | | 0 0 0 0 | 66562MiB / 81251MiB | 98 0 | 7 0 5 1 1 | | | 5MiB / 13107... | | | +-----------------------------------------------------------------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | | ============================================================================= | | 0 0 0 17483 C python 66559MiB | +-----------------------------------------------------------------------------+Code block. Example of checking GPU utilization.
- When the job runs, you can see that a process is allocated to the MIG device and its utilization increases.
Delete and release MIG Instance
Follow these steps to delete the MIG instance and detach the MIG.
Compute Instance Delete
- Delete the Compute Instance.Color mode
$ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] –dci $ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] -ci [Compute Instance] –dci$ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] –dci $ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] -ci [Compute Instance] –dciCode block. nvidia-smi command - Delete Compute Instance Color mode$ nvidia-smi mig -i 0 -gi 0 –lci +-----------------------------------------------------------------+ | Compute instance profiles: | | GPU GPU Name Profile Instances Placement | | GPU Instance ID ID Start:Size | | ID | | ================================================================= | | 0 0 MIG 7g.80gb 4 0 0:7 | +-----------------------------------------------------------------+$ nvidia-smi mig -i 0 -gi 0 –lci +-----------------------------------------------------------------+ | Compute instance profiles: | | GPU GPU Name Profile Instances Placement | | GPU Instance ID ID Start:Size | | ID | | ================================================================= | | 0 0 MIG 7g.80gb 4 0 0:7 | +-----------------------------------------------------------------+code block. Example of checking MIG Compute Instance Color mode$ nvidia-smi mig -i 0 -gi 0 –dci Successfully destroyed compute instance ID 0 from GPU instance ID 0$ nvidia-smi mig -i 0 -gi 0 –dci Successfully destroyed compute instance ID 0 from GPU instance ID 0Code block. Compute Instance deletion example Color mode$ nvidia-smi mig -i 0 -gi 0 –lci No compute instances found: Not found$ nvidia-smi mig -i 0 -gi 0 –lci No compute instances found: Not foundCode block. Confirm Compute Instance deletion
Delete GPU Instance
- Delete the GPU Instance.Color mode
$ nvidia-smi mig -i [GPU ID] –dgi $ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] –dgi$ nvidia-smi mig -i [GPU ID] –dgi $ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] –dgiCode block. nvidia-smi command - Delete GPU Instance Color mode$ nvidia-smi mig -i 0 -lgi +--------------------------------------------------------+ | GPU instances: | | GPU Name Profile Instance Placement | | ID ID Start:Size | | ======================================================== | | 0 MIG 7g.80gb 0 0 0:8 | +--------------------------------------------------------+$ nvidia-smi mig -i 0 -lgi +--------------------------------------------------------+ | GPU instances: | | GPU Name Profile Instance Placement | | ID ID Start:Size | | ======================================================== | | 0 MIG 7g.80gb 0 0 0:8 | +--------------------------------------------------------+Code block. nvidia-smi command - Example of checking GPU Instance Color mode$ nvidia-smi mig -i 0 -dgi Successfully destroyed GPU instance ID 0 from GPU 0$ nvidia-smi mig -i 0 -dgi Successfully destroyed GPU instance ID 0 from GPU 0Code block. nvidia-smi command - GPU Instance deletion example Color mode$ nvidia-smi mig -i 0 -lgi No GPU instances found: Not found$ nvidia-smi mig -i 0 -lgi No GPU instances found: Not foundcode block. nvidia-smi command - example of deleting a GPU Instance
Disable MIG feature (deactivation)
- After disabling MIG (Disable), reboot.Color mode
$ nvidia-smi -mig 0 Disabled MIG Mode for GPU 00000000:05:00.0 All done.$ nvidia-smi -mig 0 Disabled MIG Mode for GPU 00000000:05:00.0 All done.Code block. nvidia-smi command - disable MIG Color mode$ nvidia-smi Mon Sep 30 05:18:28 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 | | -------------------------------+----------------------+---------------------- | | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | | ===============================+======================+====================== | | 0 NVDIA A100-SXM... Off | 00000000:05:00.0 Off | 0 | | N/A 33C P0 60W / 400W | 0MiB / 81251MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | MIG devices: | +-----------------------------------------------------------------------------+ | GPU GI CI MIG | Memory-Usage | Vol | Shared | | ID ID Dev | BAR1-Usage | SM Unc | CE ENC DEC OFA JPG | | | | ECC | | | ============================================================================= | | No MIG devices found | +-----------------------------------------------------------------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | | ============================================================================= | | No running processes found | +-----------------------------------------------------------------------------+$ nvidia-smi Mon Sep 30 05:18:28 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 | | -------------------------------+----------------------+---------------------- | | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | | ===============================+======================+====================== | | 0 NVDIA A100-SXM... Off | 00000000:05:00.0 Off | 0 | | N/A 33C P0 60W / 400W | 0MiB / 81251MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | MIG devices: | +-----------------------------------------------------------------------------+ | GPU GI CI MIG | Memory-Usage | Vol | Shared | | ID ID Dev | BAR1-Usage | SM Unc | CE ENC DEC OFA JPG | | | | ECC | | | ============================================================================= | | No MIG devices found | +-----------------------------------------------------------------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | | ============================================================================= | | No running processes found | +-----------------------------------------------------------------------------+Code block. nvidia-smi command - check GPU status
2.4 - Use NVSwitch on GPU Server
After creating a GPU Server, you can enable the NVSwitch feature on the GPU Server’s VM (Guest OS) and use fast GPU-to-GPU P2P communication.
Exploring NVIDIA NVSwitch for Multi GPU
NVLink expands I/O by directly connecting multiple GPUs within a server both bidirectionally and GPU-to-GPU. Using NVSwitch, you can connect all GPUs in a server with full NVLink bandwidth.
Checking NVSwitch operation
Check the NVIDIA Fabric Manager, NVIDIA NVLink topology, and NVIDIA NVLink Status on the GPU server.
NVIDIA Fabric Manager operating status
Verify that active (running) is displayed when operating normally.
~$ systemctl status nvidia-fabricmanager
nvidia-fabricmanager.service - NVIDIA fabric manager service
Loaded: loaded (/lib/systemd/system/nvidia-fabricmanager.service; enabled; vendor preset: enabled)
Active: active (running) since Mon 2026-02-02 16:23:27 KST; 32min ago
Main PID: 2191 (nv-fabricmanage)
Tasks: 18 (limit: 629145)
Memory: 18.0M
CPU: 33.461s
CGroup: /system.slice/nvidia-fabricmanager.service
└─2191 /usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabricmanager.cfgnvidia-fabricmanager.service - NVIDIA fabric manager service
Loaded: loaded (/lib/systemd/system/nvidia-fabricmanager.service; enabled; vendor preset: enabled)
Active: active (running) since Mon 2026-02-02 16:23:27 KST; 32min ago
Main PID: 2191 (nv-fabricmanage)
Tasks: 18 (limit: 629145)
Memory: 18.0M
CPU: 33.461s
CGroup: /system.slice/nvidia-fabricmanager.service
└─2191 /usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabricmanager.cfgCheck NVIDIA NVLink topology
Check the NVIDIA NVLink topology.
~$ nvidia-smi topo -m
nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV12 NV12 NV12 NV12 NV12 NV12 NV12 0-127 0-7 N/A
GPU1 NV12 X NV12 NV12 NV12 NV12 NV12 NV12 0-127 0-7 N/A
GPU2 NV12 NV12 X NV12 NV12 NV12 NV12 NV12 0-127 0-7 N/A
GPU3 NV12 NV12 NV12 X NV12 NV12 NV12 NV12 0-127 0-7 N/A
GPU4 NV12 NV12 NV12 NV12 X NV12 NV12 NV12 0-127 0-7 N/A
GPU5 NV12 NV12 NV12 NV12 NV12 X NV12 NV12 0-127 0-7 N/A
GPU6 NV12 NV12 NV12 NV12 NV12 NV12 X NV12 0-127 0-7 N/A
GPU7 NV12 NV12 NV12 NV12 NV12 NV12 NV12 X 0-127 0-7 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinksnvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV12 NV12 NV12 NV12 NV12 NV12 NV12 0-127 0-7 N/A
GPU1 NV12 X NV12 NV12 NV12 NV12 NV12 NV12 0-127 0-7 N/A
GPU2 NV12 NV12 X NV12 NV12 NV12 NV12 NV12 0-127 0-7 N/A
GPU3 NV12 NV12 NV12 X NV12 NV12 NV12 NV12 0-127 0-7 N/A
GPU4 NV12 NV12 NV12 NV12 X NV12 NV12 NV12 0-127 0-7 N/A
GPU5 NV12 NV12 NV12 NV12 NV12 X NV12 NV12 0-127 0-7 N/A
GPU6 NV12 NV12 NV12 NV12 NV12 NV12 X NV12 0-127 0-7 N/A
GPU7 NV12 NV12 NV12 NV12 NV12 NV12 NV12 X 0-127 0-7 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinksCheck NVIDIA NVLink Status
Check the NVIDIA NVLink Status.
~$ nvidia-smi topo -m
GPU 1: NVIDIA A100-SXM4-80GB (UUID: GPU-64a2f685-bb12-c4af-105c-0726ece9c8d7)
Link 0: 25 GB/s
Link 1: 25 GB/s
Link 2: 25 GB/s
Link 3: 25 GB/s
Link 4: 25 GB/s
Link 5: 25 GB/s
Link 6: 25 GB/s
Link 7: 25 GB/s
Link 8: 25 GB/s
Link 9: 25 GB/s
Link 10: 25 GB/s
Link 11: 25 GB/s
GPU 2: NVIDIA A100-SXM4-80GB (UUID: GPU-2269851b-71cd-f6c7-50c5-ba1525cf3ce8)
Link 0: 25 GB/s
Link 1: 25 GB/s
Link 2: 25 GB/s
Link 3: 25 GB/s
Link 4: 25 GB/s
Link 5: 25 GB/s
Link 6: 25 GB/s
Link 7: 25 GB/s
Link 8: 25 GB/s
Link 9: 25 GB/s
Link 10: 25 GB/s
Link 11: 25 GB/s
GPU 3: NVIDIA A100-SXM4-80GB (UUID: GPU-4c397bbf-95fc-5c29-918a-a429cbe45a7a)
Link 0: 25 GB/s
Link 1: 25 GB/s
Link 2: 25 GB/s
Link 3: 25 GB/s
Link 4: 25 GB/s
Link 5: 25 GB/s
Link 6: 25 GB/s
Link 7: 25 GB/s
Link 8: 25 GB/s
Link 9: 25 GB/s
Link 10: 25 GB/s
Link 11: 25 GB/s
GPU 4: NVIDIA A100-SXM4-80GB (UUID: GPU-0e350204-9fb6-2cbe-538e-8f7849658eb8)
Link 0: 25 GB/s
Link 1: 25 GB/s
Link 2: 25 GB/s
Link 3: 25 GB/s
Link 4: 25 GB/s
Link 5: 25 GB/s
Link 6: 25 GB/s
Link 7: 25 GB/s
Link 8: 25 GB/s
Link 9: 25 GB/s
Link 10: 25 GB/s
Link 11: 25 GB/s
GPU 5: NVIDIA A100-SXM4-80GB (UUID: GPU-45f0c453-4760-edd4-3af9-25c5ea7473a5)
Link 0: 25 GB/s
Link 1: 25 GB/s
Link 2: 25 GB/s
Link 3: 25 GB/s
Link 4: 25 GB/s
Link 5: 25 GB/s
Link 6: 25 GB/s
Link 7: 25 GB/s
Link 8: 25 GB/s
Link 9: 25 GB/s
Link 10: 25 GB/s
Link 11: 25 GB/s
GPU 6: NVIDIA A100-SXM4-80GB (UUID: GPU-38409794-bb34-430e-3c50-90b42cb2bb72)
Link 0: 25 GB/s
Link 1: 25 GB/s
Link 2: 25 GB/s
Link 3: 25 GB/s
Link 4: 25 GB/s
Link 5: 25 GB/s
Link 6: 25 GB/s
Link 7: 25 GB/s
Link 8: 25 GB/s
Link 9: 25 GB/s
Link 10: 25 GB/s
Link 11: 25 GB/s
GPU 7: NVIDIA A100-SXM4-80GB (UUID: GPU-3fb478aa-801b-eb64-55c2-0ffc3f2ce404)
Link 0: 25 GB/s
Link 1: 25 GB/s
Link 2: 25 GB/s
Link 3: 25 GB/s
Link 4: 25 GB/s
Link 5: 25 GB/s
Link 6: 25 GB/s
Link 7: 25 GB/s
Link 8: 25 GB/s
Link 9: 25 GB/s
Link 10: 25 GB/s
Link 11: 25 GB/sGPU 1: NVIDIA A100-SXM4-80GB (UUID: GPU-64a2f685-bb12-c4af-105c-0726ece9c8d7)
Link 0: 25 GB/s
Link 1: 25 GB/s
Link 2: 25 GB/s
Link 3: 25 GB/s
Link 4: 25 GB/s
Link 5: 25 GB/s
Link 6: 25 GB/s
Link 7: 25 GB/s
Link 8: 25 GB/s
Link 9: 25 GB/s
Link 10: 25 GB/s
Link 11: 25 GB/s
GPU 2: NVIDIA A100-SXM4-80GB (UUID: GPU-2269851b-71cd-f6c7-50c5-ba1525cf3ce8)
Link 0: 25 GB/s
Link 1: 25 GB/s
Link 2: 25 GB/s
Link 3: 25 GB/s
Link 4: 25 GB/s
Link 5: 25 GB/s
Link 6: 25 GB/s
Link 7: 25 GB/s
Link 8: 25 GB/s
Link 9: 25 GB/s
Link 10: 25 GB/s
Link 11: 25 GB/s
GPU 3: NVIDIA A100-SXM4-80GB (UUID: GPU-4c397bbf-95fc-5c29-918a-a429cbe45a7a)
Link 0: 25 GB/s
Link 1: 25 GB/s
Link 2: 25 GB/s
Link 3: 25 GB/s
Link 4: 25 GB/s
Link 5: 25 GB/s
Link 6: 25 GB/s
Link 7: 25 GB/s
Link 8: 25 GB/s
Link 9: 25 GB/s
Link 10: 25 GB/s
Link 11: 25 GB/s
GPU 4: NVIDIA A100-SXM4-80GB (UUID: GPU-0e350204-9fb6-2cbe-538e-8f7849658eb8)
Link 0: 25 GB/s
Link 1: 25 GB/s
Link 2: 25 GB/s
Link 3: 25 GB/s
Link 4: 25 GB/s
Link 5: 25 GB/s
Link 6: 25 GB/s
Link 7: 25 GB/s
Link 8: 25 GB/s
Link 9: 25 GB/s
Link 10: 25 GB/s
Link 11: 25 GB/s
GPU 5: NVIDIA A100-SXM4-80GB (UUID: GPU-45f0c453-4760-edd4-3af9-25c5ea7473a5)
Link 0: 25 GB/s
Link 1: 25 GB/s
Link 2: 25 GB/s
Link 3: 25 GB/s
Link 4: 25 GB/s
Link 5: 25 GB/s
Link 6: 25 GB/s
Link 7: 25 GB/s
Link 8: 25 GB/s
Link 9: 25 GB/s
Link 10: 25 GB/s
Link 11: 25 GB/s
GPU 6: NVIDIA A100-SXM4-80GB (UUID: GPU-38409794-bb34-430e-3c50-90b42cb2bb72)
Link 0: 25 GB/s
Link 1: 25 GB/s
Link 2: 25 GB/s
Link 3: 25 GB/s
Link 4: 25 GB/s
Link 5: 25 GB/s
Link 6: 25 GB/s
Link 7: 25 GB/s
Link 8: 25 GB/s
Link 9: 25 GB/s
Link 10: 25 GB/s
Link 11: 25 GB/s
GPU 7: NVIDIA A100-SXM4-80GB (UUID: GPU-3fb478aa-801b-eb64-55c2-0ffc3f2ce404)
Link 0: 25 GB/s
Link 1: 25 GB/s
Link 2: 25 GB/s
Link 3: 25 GB/s
Link 4: 25 GB/s
Link 5: 25 GB/s
Link 6: 25 GB/s
Link 7: 25 GB/s
Link 8: 25 GB/s
Link 9: 25 GB/s
Link 10: 25 GB/s
Link 11: 25 GB/s2.5 - Install ServiceWatch Agent
Users can install the ServiceWatch Agent on a GPU server to collect custom metrics and logs.
ServiceWatch Agent
There are two main types of agents that need to be installed on a GPU server to collect custom metrics and logs for ServiceWatch. It is a Prometheus Exporter and Open Telemetry Collector.
| Category | Detailed description | |
|---|---|---|
| Prometheus Exporter | Provide metrics of a specific application or service in a format that Prometheus can scrape
| |
| Open Telemetry Collector | Acts as a centralized collector that gathers telemetry data such as metrics and logs from distributed systems, processes (filtering, sampling, etc.) it, and exports it to multiple backends (e.g., Prometheus, Jaeger, Elasticsearch, etc.)
|
If you have configured a Kubernetes Engine on a GPU server, please check GPU metrics using the metrics provided by the Kubernetes Engine.
- If you install the DCGM Exporter on a GPU server configured with Kubernetes Engine, it may not operate correctly.
Pre-configuration for Using ServiceWatch Agent
To utilize the ServiceWatch Agent, please refer to Prerequisite Settings for ServiceWatch Agent and prepare the prerequisite settings.
Install Prometheus Exporter for GPU metrics (for Ubuntu)
Install the Prometheus Exporter for collecting metrics from the GPU server according to the steps below.
Verify NVDIA Driver Installation
- Check the installed NVDIA Driver.Color mode
nvidia-smi --query-gpu driver_version --format csvnvidia-smi --query-gpu driver_version --format csvCode block. NVDIA Driver version check command Color modedriver_version 535.183.06 ... 535.183.06driver_version 535.183.06 ... 535.183.06Code block. NVDIA Driver version check example
NVSwitch Configuration and Query (NSCQ) Library Installation
Install cuda-keyring.
Color modewget https://developer.download.nvidia.com/compute/cuda/repos/<distro>/<arch>/cuda-keyring_1.1-1_all.debwget https://developer.download.nvidia.com/compute/cuda/repos/<distro>/<arch>/cuda-keyring_1.1-1_all.debcode block. NSCQ library download command Color modesudo dpkg -i cuda-keyring_1.1-1_all.deb apt updatesudo dpkg -i cuda-keyring_1.1-1_all.deb apt updateCode block. NSCQ library installation command Color modenvidia-smi --query-gpu driver_version --format csvnvidia-smi --query-gpu driver_version --format csvCode block. NVDIA Driver version check command Color modedriver_version 535.183.06 ... 535.183.06driver_version 535.183.06 ... 535.183.06code block. Example of checking NVDIA driver version Install libnvidia-nscq.
Color modeapt-cache policy libnvidia-nscq-535apt-cache policy libnvidia-nscq-535Code block. NSCQ library apt-cache command Color modelibnvidia-nscq-535: Installed: (none) Candidate: 535.247.01-1 Version table: 535.247.01-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages ... 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages 535.216.01-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages 535.183.06-1 600 # Install the version that matches the driver 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages 535.183.01-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages 535.54.03-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packageslibnvidia-nscq-535: Installed: (none) Candidate: 535.247.01-1 Version table: 535.247.01-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages ... 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages 535.216.01-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages 535.183.06-1 600 # Install the version that matches the driver 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages 535.183.01-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages 535.54.03-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 PackagesCode block. NSCQ library apt-cache command result Color modeapt install libnvidia-nscq-535=535.183.06-1apt install libnvidia-nscq-535=535.183.06-1Code block. NSCQ library installation command
It must be installed with the same version as the NVDIA Driver.
- Example) driver version: 535.183.06, libnvdia-nscq version: 535.183.06-1
NVSwitch Device Monitoring API(NVSDM) Library Installation
- Install the NVSDM library.Color mode
apt-cache policy libnvsdmapt-cache policy libnvsdmCode block. NVSDM library apt-cache command Color modelibnvsdm: Installed: (none) Candidate: 580.105.08-1 Version table: 580.105.08-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages 580.95.05-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages 580.82.07-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages 580.65.06-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packageslibnvsdm: Installed: (none) Candidate: 580.105.08-1 Version table: 580.105.08-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages 580.95.05-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages 580.82.07-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages 580.65.06-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 PackagesCode block. NVSDM library apt-cache command result Color modeapt install libnvsdm=580.105.08-1apt install libnvsdm=580.105.08-1Code block. Install NVSDM library
Install NVIDIA DCGM (for Ubuntu)
Install the DCGM Exporter according to the steps below.
- DCGM(datacenter-gpu-manager) Installation
- datacenter-gpu-manager-exporter installation
- DCGM Service Activation and Start
DCGM(datacenter-gpu-manager) Installation
refers to a specific version of NVIDIA’s Data Center GPU Manager (DCGM) tool, which is a package for managing and monitoring NVIDIA data center GPUs. In particular, cuda12 indicates that this management tool is installed for the CUDA 12 version, and datacenter-gpu-manager-4 refers to the 4.x version of DCGM. This tool provides a variety of features, including GPU status monitoring, diagnostics, alert system, and power/clock management.
- Check the CUDA version.Color mode
nvidia-smi | grep CUDAnvidia-smi | grep CUDACode block. Check CUDA version Color mode| NVIDIA-SMI 535.183.06 Driver Version: 535.183.06 CUDA Version: 12.2 || NVIDIA-SMI 535.183.06 Driver Version: 535.183.06 CUDA Version: 12.2 |Code block. Example of CUDA version check result Color modeCUDA_VERSION=12CUDA_VERSION=12Code block. CUDA version setting command - Install datacenter-gpu-manager-cuda.Color mode
apt install datacenter-gpu-manager-4-cuda${CUDA_VERSION}apt install datacenter-gpu-manager-4-cuda${CUDA_VERSION}Code block. datacenter-gpu-manager-cuda installation command
Install datacenter-gpu-manager-exporter
It is a tool that, based on NVIDIA Data Center GPU Manager (DCGM), collects various GPU metrics such as GPU usage, memory usage, temperature, and power consumption, and exposes them for use in monitoring systems like Prometheus.
- Install datacenter-gpu-manager-exporter.Color mode
apt install datacenter-gpu-manager-exporterapt install datacenter-gpu-manager-exporterCode block. datacenter-gpu-manager-exporter installation command - Check the DCGM Exporter configuration file.Color mode
cat /usr/lib/systemd/system/nvidia-dcgm-exporter.service | grep ExecStartcat /usr/lib/systemd/system/nvidia-dcgm-exporter.service | grep ExecStartCode block. Command to check the datacenter-gpu-manager-exporter configuration file Color modeExecStart=/usr/bin/dcgm-exporter -f /etc/dcgm-exporter/default-counters.csvExecStart=/usr/bin/dcgm-exporter -f /etc/dcgm-exporter/default-counters.csvcode block. Example of verification result for the datacenter-gpu-manager-exporter configuration file - When installing the DCGM Exporter, review the provided configuration and remove
#for the required metrics, and add#for the unnecessary metrics.Color modevi /etc/dcgm-exporter/default-counters.csv ## Example ## ... DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active. DCGM_FI_PROF_DRAM_ACTIVE, gauge, Ratio of cycles the device memory interface is active sending or receiving data. # DCGM_FI_PROF_PIPE_FP64_ACTIVE, gauge, Ratio of cycles the fp64 pipes are active. # DCGM_FI_PROF_PIPE_FP32_ACTIVE, gauge, Ratio of cycles the fp32 pipes are active. ...vi /etc/dcgm-exporter/default-counters.csv ## Example ## ... DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active. DCGM_FI_PROF_DRAM_ACTIVE, gauge, Ratio of cycles the device memory interface is active sending or receiving data. # DCGM_FI_PROF_PIPE_FP64_ACTIVE, gauge, Ratio of cycles the fp64 pipes are active. # DCGM_FI_PROF_PIPE_FP32_ACTIVE, gauge, Ratio of cycles the fp32 pipes are active. ...Code block. Example of datacenter-gpu-manager-exporter metric configuration
Enable and start DCGM service
Enable and start the nvdia-dcgm service.
Color modesystemctl enable --now nvidia-dcgmsystemctl enable --now nvidia-dcgmCode block. nvdia-dcgm service activation and start command Enable and start the nvdia-dcgm-exporter service.
Color modesystemctl enable --now nvidia-dcgm-exportersystemctl enable --now nvidia-dcgm-exporterCode block. nvdia-dcgm-exporter service activation and start command
For more details, refer to ServiceWatch > ServiceWatch Agent 사용하기.
Install Prometheus Exporter for GPU metrics (for RHEL)
Install the ServiceWatch Agent to collect metrics from the GPU server according to the steps below.
NVDIA Driver installation verification (for RHEL)
- Check the installed NVDIA Driver.Color mode
nvidia-smi --query-gpu driver_version --format csvnvidia-smi --query-gpu driver_version --format csvCode block. NVDIA Driver version check command Color modedriver_version 535.183.06 ... 535.183.06driver_version 535.183.06 ... 535.183.06Code block. Example of checking NVDIA Driver version
NVSwitch Configuration and Query (NSCQ) Library installation (for RHEL)
The NVSwitch Configuration and Query (NSCQ) Library is required for Hopper or earlier generation GPUs.
- For RHEL, verify that libnvdia-nscq is installed and install it if necessary.
Checking the libnvdia-nscq package.
Color moderpm -qa | grep libnvidia-nscq libnvidia-nscq-535-535.183.06-1.x86_64rpm -qa | grep libnvidia-nscq libnvidia-nscq-535-535.183.06-1.x86_64Code block. Check NSCQ library package Add the CUDA Repository to DNF.
Color modednf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repodnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repoCode block. Add DNF Repository NVDIA Driver status reset
Color modednf module reset nvidia-driverdnf module reset nvidia-driverCode block. Initialize the state of the NVIDIA Driver DNF module Color modeUpdating Subscription Management repositories. Last metadata expiration check: 0:03:15 ago on Wed 19 Nov 2025 01:23:48 AM EST. Dependencies resolved. ============================================= Package Architecture Version Repository Size ============================================= Disabling module profiles: nvidia-driver/default nvidia-driver/fm Resetting modules: nvidia-driver Transaction Summary ============================================= Is this ok [y/N]: yUpdating Subscription Management repositories. Last metadata expiration check: 0:03:15 ago on Wed 19 Nov 2025 01:23:48 AM EST. Dependencies resolved. ============================================= Package Architecture Version Repository Size ============================================= Disabling module profiles: nvidia-driver/default nvidia-driver/fm Resetting modules: nvidia-driver Transaction Summary ============================================= Is this ok [y/N]: yCode block. Example of the status initialization result of the NVIDIA Driver DNF module. Enable the NVDIA Driver module.
Color modednf module enable nvidia-driver:535-opendnf module enable nvidia-driver:535-opencode block. NVDIA Driver module activation Color modeUpdating Subscription Management repositories. Last metadata expiration check: 0:04:22 ago on Wed 19 Nov 2025 01:23:48 AM EST. Dependencies resolved. ============================================= Package Architecture Version Repository Size ============================================= Enabling module streams: nvidia-driver 535-open Transaction Summary ============================================= Is this ok [y/N]: yUpdating Subscription Management repositories. Last metadata expiration check: 0:04:22 ago on Wed 19 Nov 2025 01:23:48 AM EST. Dependencies resolved. ============================================= Package Architecture Version Repository Size ============================================= Enabling module streams: nvidia-driver 535-open Transaction Summary ============================================= Is this ok [y/N]: yCode block. Example of NVDIA Driver module activation result Check the libnvdia-nscq module list.
Color modednf list libnvidia-nscq-535 --showduplicatesdnf list libnvidia-nscq-535 --showduplicatesCode block. Check libnvdia-nscq module list Install libnvdia-nscq.
Color modednf install libnvidia-nscq-535-535.183.06-1dnf install libnvidia-nscq-535-535.183.06-1code block. libnvdia-nscq installation command
NVSwitch Device Monitoring API(NVSDM) Library Installation (for RHEL)
Check the NVSDM library module list.
Color modednf list libnvsdm --showduplicatesdnf list libnvsdm --showduplicatesCode block. Check NVSDM library module list Color modelibnvsdm: Installed: (none) Candidate: 580.105.08-1 Version table: 580.105.08-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages 580.95.05-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages 580.82.07-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages 580.65.06-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packageslibnvsdm: Installed: (none) Candidate: 580.105.08-1 Version table: 580.105.08-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages 580.95.05-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages 580.82.07-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages 580.65.06-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 PackagesCode block. Example of NVSDM library module list verification results. Install libnvsdm.
Color modednf install libnvsdm-580.105.08-1dnf install libnvsdm-580.105.08-1Code block. Install NVSDM library Color modeUpdating Subscription Management repositories. Last metadata expiration check: 0:08:18 ago on Wed 19 Nov 2025 01:05:28 AM EST. Dependencies resolved. ========================================================================= Package Architecture Version Repository Size ========================================================================= Installing: libnvsdm x86_64 580.105.08-1 cuda-rhel8-x86_64 675 k Installing dependencies: infiniband-diags x86_64 48.0-1.el8 rhel-8-for-x86_64-baseos-rpms 323 k libibumad x86_64 48.0-1.el8 rhel-8-for-x86_64-baseos-rpms 34 k Transaction Summary ========================================================================= Install 3 Packages Total download size: 1.0 M Installed size: 3.2 M Is this ok [y/N]: yUpdating Subscription Management repositories. Last metadata expiration check: 0:08:18 ago on Wed 19 Nov 2025 01:05:28 AM EST. Dependencies resolved. ========================================================================= Package Architecture Version Repository Size ========================================================================= Installing: libnvsdm x86_64 580.105.08-1 cuda-rhel8-x86_64 675 k Installing dependencies: infiniband-diags x86_64 48.0-1.el8 rhel-8-for-x86_64-baseos-rpms 323 k libibumad x86_64 48.0-1.el8 rhel-8-for-x86_64-baseos-rpms 34 k Transaction Summary ========================================================================= Install 3 Packages Total download size: 1.0 M Installed size: 3.2 M Is this ok [y/N]: yCode block. Example of the NVSDM library installation command result
Install NVIDIA DCGM (for RHEL)
Install Node Exporter according to the steps below.
- DCGM(datacenter-gpu-manager) Installation
- datacenter-gpu-manager-exporter installation
- DCGM Service Activation and Start
DCGM(datacenter-gpu-manager) Installation (for RHEL)
refers to a specific version of NVIDIA’s Data Center GPU Manager (DCGM) tool, which is a package for managing and monitoring NVIDIA data center GPUs. In particular, cuda12 indicates that this management tool is installed for the CUDA 12 version, and datacenter-gpu-manager-4 refers to the 4.x version of DCGM. This tool provides various features, including GPU status monitoring, diagnostics, alert system, and power/clock management.
- Add the CUDA repository to DNF.Color mode
dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repodnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repocode block. Add DNF Repository - Check the CUDA version.Color mode
nvidia-smi | grep CUDAnvidia-smi | grep CUDACode block. Check CUDA version Color mode| NVIDIA-SMI 535.183.06 Driver Version: 535.183.06 CUDA Version: 12.2 || NVIDIA-SMI 535.183.06 Driver Version: 535.183.06 CUDA Version: 12.2 |Code block. Example of CUDA version check result Color modeCUDA_VERSION=12CUDA_VERSION=12Code block. CUDA version setting command - Check the list of datacenter-gpu-manager-cuda modules.Color mode
dnf list datacenter-gpu-manager-4-cuda${CUDA_VERSION} --showduplicatesdnf list datacenter-gpu-manager-4-cuda${CUDA_VERSION} --showduplicatesCode block. Check the datacenter-gpu-manager-cuda module list Color modeUpdating Subscription Management repositories. Unable to read consumer identity This system is not registered with an entitlement server. You can use subscription-manager to register. Last metadata expiration check: 0:00:34 ago on Wed 19 Nov 2025 12:26:56 AM EST. Available Packages datacenter-gpu-manager-4-cuda12.x86_64 1:4.0.0-1 cuda-rhel8-x86_64 datacenter-gpu-manager-4-cuda12.x86_64 1:4.1.0-1 cuda-rhel8-x86_64 datacenter-gpu-manager-4-cuda12.x86_64 1:4.1.1-1 cuda-rhel8-x86_64 datacenter-gpu-manager-4-cuda12.x86_64 1:4.2.0-1 cuda-rhel8-x86_64 datacenter-gpu-manager-4-cuda12.x86_64 1:4.2.2-1 cuda-rhel8-x86_64 datacenter-gpu-manager-4-cuda12.x86_64 1:4.2.3-1 cuda-rhel8-x86_64 datacenter-gpu-manager-4-cuda12.x86_64 1:4.2.3-2 cuda-rhel8-x86_64 datacenter-gpu-manager-4-cuda12.x86_64 1:4.3.0-1 cuda-rhel8-x86_64 datacenter-gpu-manager-4-cuda12.x86_64 1:4.3.1-1 cuda-rhel8-x86_64 datacenter-gpu-manager-4-cuda12.x86_64 1:4.4.0-1 cuda-rhel8-x86_64 datacenter-gpu-manager-4-cuda12.x86_64 1:4.4.1-1 cuda-rhel8-x86_64 datacenter-gpu-manager-4-cuda12.x86_64 1:4.4.2-1 cuda-rhel8-x86_64Updating Subscription Management repositories. Unable to read consumer identity This system is not registered with an entitlement server. You can use subscription-manager to register. Last metadata expiration check: 0:00:34 ago on Wed 19 Nov 2025 12:26:56 AM EST. Available Packages datacenter-gpu-manager-4-cuda12.x86_64 1:4.0.0-1 cuda-rhel8-x86_64 datacenter-gpu-manager-4-cuda12.x86_64 1:4.1.0-1 cuda-rhel8-x86_64 datacenter-gpu-manager-4-cuda12.x86_64 1:4.1.1-1 cuda-rhel8-x86_64 datacenter-gpu-manager-4-cuda12.x86_64 1:4.2.0-1 cuda-rhel8-x86_64 datacenter-gpu-manager-4-cuda12.x86_64 1:4.2.2-1 cuda-rhel8-x86_64 datacenter-gpu-manager-4-cuda12.x86_64 1:4.2.3-1 cuda-rhel8-x86_64 datacenter-gpu-manager-4-cuda12.x86_64 1:4.2.3-2 cuda-rhel8-x86_64 datacenter-gpu-manager-4-cuda12.x86_64 1:4.3.0-1 cuda-rhel8-x86_64 datacenter-gpu-manager-4-cuda12.x86_64 1:4.3.1-1 cuda-rhel8-x86_64 datacenter-gpu-manager-4-cuda12.x86_64 1:4.4.0-1 cuda-rhel8-x86_64 datacenter-gpu-manager-4-cuda12.x86_64 1:4.4.1-1 cuda-rhel8-x86_64 datacenter-gpu-manager-4-cuda12.x86_64 1:4.4.2-1 cuda-rhel8-x86_64Code block. Example of checking the list of datacenter-gpu-manager-cuda modules. - Install datacenter-gpu-manager-cuda.Color mode
dnf install datacenter-gpu-manager-4-cuda${CUDA_VERSION}dnf install datacenter-gpu-manager-4-cuda${CUDA_VERSION}Code block. Install datacenter-gpu-manager-cuda Color modeUpdating Subscription Management repositories. Unable to read consumer identity This system is not registered with an entitlement server. You can use subscription-manager to register. Last metadata expiration check: 0:07:12 ago on Wed 19 Nov 2025 12:26:56 AM EST. Dependencies resolved. =================================================================================================== Package Architecture Version Repository Size =================================================================================================== Installing: datacenter-gpu-manager-4-cuda12 x86_64 1:4.4.2-1 cuda-rhel8-x86_64 554 M Installing dependencies: datacenter-gpu-manager-4-core x86_64 1:4.4.2-1 cuda-rhel8-x86_64 9.9 M Installing weak dependencies: datacenter-gpu-manager-4-proprietary x86_64 1:4.4.2-1 cuda-rhel8-x86_64 5.3 M datacenter-gpu-manager-4-proprietary-cuda12 x86_64 1:4.4.2-1 cuda-rhel8-x86_64 289 M Transaction Summary ==================================================================================================== Install 4 Packages ... Is this ok [y/N]: yUpdating Subscription Management repositories. Unable to read consumer identity This system is not registered with an entitlement server. You can use subscription-manager to register. Last metadata expiration check: 0:07:12 ago on Wed 19 Nov 2025 12:26:56 AM EST. Dependencies resolved. =================================================================================================== Package Architecture Version Repository Size =================================================================================================== Installing: datacenter-gpu-manager-4-cuda12 x86_64 1:4.4.2-1 cuda-rhel8-x86_64 554 M Installing dependencies: datacenter-gpu-manager-4-core x86_64 1:4.4.2-1 cuda-rhel8-x86_64 9.9 M Installing weak dependencies: datacenter-gpu-manager-4-proprietary x86_64 1:4.4.2-1 cuda-rhel8-x86_64 5.3 M datacenter-gpu-manager-4-proprietary-cuda12 x86_64 1:4.4.2-1 cuda-rhel8-x86_64 289 M Transaction Summary ==================================================================================================== Install 4 Packages ... Is this ok [y/N]: yCode block. Example of datacenter-gpu-manager-cuda installation result
datacenter-gpu-manager-exporter installation (for RHEL)
It is a tool that, based on NVIDIA Data Center GPU Manager (DCGM), collects various GPU metrics such as GPU usage, memory usage, temperature, and power consumption, and exposes them for use in monitoring systems like Prometheus.
Add the CUDA Repository to DNF. 1. (If you have already performed this command, proceed to the next step.)
Color modednf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repodnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repoCode block. Add DNF Repository Check the CUDA version. 2. (If you have already performed this command, proceed to the next step.)
Color modenvidia-smi | grep CUDAnvidia-smi | grep CUDACode block. Check CUDA version Color mode| NVIDIA-SMI 535.183.06 Driver Version: 535.183.06 CUDA Version: 12.2 || NVIDIA-SMI 535.183.06 Driver Version: 535.183.06 CUDA Version: 12.2 |Code block. Example of CUDA version check result Color modeCUDA_VERSION=12CUDA_VERSION=12Code block. CUDA version setting command Check the datacenter-gpu-manager-exporter module list.
Color modednf list datacenter-gpu-manager-exporter --showduplicatesdnf list datacenter-gpu-manager-exporter --showduplicatesCode block. Check the list of datacenter-gpu-manager-exporter modules Color modeUpdating Subscription Management repositories. Unable to read consumer identity This system is not registered with an entitlement server. You can use subscription-manager to register. Last metadata expiration check: 0:02:11 ago on Wed 19 Nov 2025 12:26:56 AM EST. Available Packages datacenter-gpu-manager-exporter.x86_64 4.0.1-1 cuda-rhel8-x86_64 datacenter-gpu-manager-exporter.x86_64 4.1.0-1 cuda-rhel8-x86_64 datacenter-gpu-manager-exporter.x86_64 4.1.1-1 cuda-rhel8-x86_64 datacenter-gpu-manager-exporter.x86_64 4.1.3-1 cuda-rhel8-x86_64 datacenter-gpu-manager-exporter.x86_64 4.5.0-1 cuda-rhel8-x86_64 datacenter-gpu-manager-exporter.x86_64 4.5.1-1 cuda-rhel8-x86_64 datacenter-gpu-manager-exporter.x86_64 4.5.2-1 cuda-rhel8-x86_64 datacenter-gpu-manager-exporter.x86_64 4.6.0-1 cuda-rhel8-x86_64Updating Subscription Management repositories. Unable to read consumer identity This system is not registered with an entitlement server. You can use subscription-manager to register. Last metadata expiration check: 0:02:11 ago on Wed 19 Nov 2025 12:26:56 AM EST. Available Packages datacenter-gpu-manager-exporter.x86_64 4.0.1-1 cuda-rhel8-x86_64 datacenter-gpu-manager-exporter.x86_64 4.1.0-1 cuda-rhel8-x86_64 datacenter-gpu-manager-exporter.x86_64 4.1.1-1 cuda-rhel8-x86_64 datacenter-gpu-manager-exporter.x86_64 4.1.3-1 cuda-rhel8-x86_64 datacenter-gpu-manager-exporter.x86_64 4.5.0-1 cuda-rhel8-x86_64 datacenter-gpu-manager-exporter.x86_64 4.5.1-1 cuda-rhel8-x86_64 datacenter-gpu-manager-exporter.x86_64 4.5.2-1 cuda-rhel8-x86_64 datacenter-gpu-manager-exporter.x86_64 4.6.0-1 cuda-rhel8-x86_64code block. Example of checking the module list of datacenter-gpu-manager-exporter Install datacenter-gpu-manager-cuda. dcgm-exporter 4.5.X requires glibc 2.34 or newer, but because RHEL 9 provides glibc 2.34, we install it by specifying version 4.1.3-1.
Color modednf install datacenter-gpu-manager-exporter-4.1.3-1dnf install datacenter-gpu-manager-exporter-4.1.3-1Code block. Install datacenter-gpu-manager-cuda Color modeUpdating Subscription Management repositories. Unable to read consumer identity This system is not registered with an entitlement server. You can use subscription-manager to register. Last metadata expiration check: 0:07:12 ago on Wed 19 Nov 2025 12:26:56 AM EST. Dependencies resolved. ==================================================================================================== Package Architecture Version Repository Size ==================================================================================================== Installing: datacenter-gpu-manager-exporter x86_64 4.1.3-1 cuda-rhel8-x86_64 26 M Is this ok [y/N]: yUpdating Subscription Management repositories. Unable to read consumer identity This system is not registered with an entitlement server. You can use subscription-manager to register. Last metadata expiration check: 0:07:12 ago on Wed 19 Nov 2025 12:26:56 AM EST. Dependencies resolved. ==================================================================================================== Package Architecture Version Repository Size ==================================================================================================== Installing: datacenter-gpu-manager-exporter x86_64 4.1.3-1 cuda-rhel8-x86_64 26 M Is this ok [y/N]: yCode block. Example of datacenter-gpu-manager-cuda installation result Color modecat /usr/lib/systemd/system/nvidia-dcgm-exporter.service | grep ExecStartcat /usr/lib/systemd/system/nvidia-dcgm-exporter.service | grep ExecStartCode block. datacenter-gpu-manager-exporter configuration file Color modeExecStart=/usr/bin/dcgm-exporter -f /etc/dcgm-exporter/default-counters.csvExecStart=/usr/bin/dcgm-exporter -f /etc/dcgm-exporter/default-counters.csvCode block. Example of the result of checking the datacenter-gpu-manager-exporter configuration file. When installing the DCGM Exporter, review the provided configuration and remove
#for required metrics, and add#for unnecessary metrics.Color modevi /etc/dcgm-exporter/default-counters.csv ## Example ## ... DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active. DCGM_FI_PROF_DRAM_ACTIVE, gauge, Ratio of cycles the device memory interface is active sending or receiving data. # DCGM_FI_PROF_PIPE_FP64_ACTIVE, gauge, Ratio of cycles the fp64 pipes are active. # DCGM_FI_PROF_PIPE_FP32_ACTIVE, gauge, Ratio of cycles the fp32 pipes are active. ...vi /etc/dcgm-exporter/default-counters.csv ## Example ## ... DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active. DCGM_FI_PROF_DRAM_ACTIVE, gauge, Ratio of cycles the device memory interface is active sending or receiving data. # DCGM_FI_PROF_PIPE_FP64_ACTIVE, gauge, Ratio of cycles the fp64 pipes are active. # DCGM_FI_PROF_PIPE_FP32_ACTIVE, gauge, Ratio of cycles the fp32 pipes are active. ...Code block. Example of datacenter-gpu-manager-exporter metric configuration
Enable and start DCGM service (for RHEL)
Activating and starting the nvdia-dcgm service.
Color modesystemctl enable --now nvidia-dcgmsystemctl enable --now nvidia-dcgmCode block. nvdia-dcgm service activation and start command Enable and start the nvdia-dcgm-exporter service.
Color modesystemctl enable --now nvidia-dcgm-exportersystemctl enable --now nvidia-dcgm-exporterCode block. nvdia-dcgm-exporter service activation and start command
For more details, refer to ServiceWatch > Using ServiceWatch Agent.
DCGM Exporter metric
DCGM Exporter Key Metrics
Among the metrics provided by the DCGM Exporter, the main GPU metrics are as follows.
| Category | DCGM Field | Prometheus Metric Type | Summary | |
|---|---|---|---|---|
| Clocks | DCGM_FI_DEV_SM_CLOCK | gauge | SM clock frequency (in MHz) | |
| Clocks | DCGM_FI_DEV_MEM_CLOCK | gauge | Memory clock frequency (in MHz) | |
| Temperature | DCGM_FI_DEV_GPU_TEMP | gauge | GPU temperature (in C) | |
| Power | DCGM_FI_DEV_POWER_USAGE | gauge | Power draw (in W) | |
| Utilization | DCGM_FI_DEV_GPU_UTIL | gauge | GPU utilization (in %) | |
| Utilization | DCGM_FI_DEV_MEM_COPY_UTIL | gauge | Memory utilization (in %) | |
| Memory Usage | DCGM_FI_DEV_FB_FREE | gauge | Frame buffer memory free (in MiB) | |
| Memory Usage | DCGM_FI_DEV_FB_USED | gauge | Frame buffer memory used (in MiB) | |
| Nvlink | DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL(8 GPU only) | counter | Total number of NVLink bandwidth counters for all lanes |
DCGM Exporter metric collection configuration
Please refer to the default metrics of DCGM Exporter at DCGM Exporter > 기본 지표.
- Besides the default configuration, remove
#from default-counters.csv for any additional metrics you want to set. - For default metrics you do not wish to collect, add
#or delete the corresponding entry.
# Format
# If line starts with a '#' it is considered a comment
# DCGM FIELD, Prometheus metric type, help message
# Clocks
DCGM_FI_DEV_SM_CLOCK, gauge, SM clock frequency (in MHz).
DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).
# Temperature
DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).
DCGM_FI_DEV_GPU_TEMP, gauge, GPU temperature (in C).
# Power
DCGM_FI_DEV_POWER_USAGE, gauge, Power draw (in W).
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ).
# PCIE
# DCGM_FI_PROF_PCIE_TX_BYTES, counter, Total number of bytes transmitted through PCIe TX via NVML.
# DCGM_FI_PROF_PCIE_RX_BYTES, counter, Total number of bytes received through PCIe RX via NVML.
...# Format
# If line starts with a '#' it is considered a comment
# DCGM FIELD, Prometheus metric type, help message
# Clocks
DCGM_FI_DEV_SM_CLOCK, gauge, SM clock frequency (in MHz).
DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).
# Temperature
DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).
DCGM_FI_DEV_GPU_TEMP, gauge, GPU temperature (in C).
# Power
DCGM_FI_DEV_POWER_USAGE, gauge, Power draw (in W).
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ).
# PCIE
# DCGM_FI_PROF_PCIE_TX_BYTES, counter, Total number of bytes transmitted through PCIe TX via NVML.
# DCGM_FI_PROF_PCIE_RX_BYTES, counter, Total number of bytes received through PCIe RX via NVML.
...3 - API Reference
4 - CLI Reference
5 - Release Note
GPU Server
- Add Kubernetes image
- When creating a GPU Server, you can select the Kubernetes image (Ubuntu).
- An SSD volume with configurable IOPS and throughput has been added.
- When creating Block Storage, you can select the SSD_Provisioned disk type.
- You can set the maximum IOPS and Throughput values.
- Provision of ServiceWatch service integration
- You can monitor data through the ServiceWatch service.
- When creating a GPU server, you can select a RHEL image.
- Keypair management feature has been added.
- You can generate a keypair for use, or retrieve a public key and apply it.
- Add GPU Server feature
- The IP, Public NAT IP, and Private NAT IP configuration features have been added.
- An LLM Endpoint is provided for LLM usage.
- The method for sharing images between accounts has changed.
- You can create a new Image for sharing and share it.
- Add GPU Server usage guide
- Added the guides Using Multi-instance GPU on GPU Server and Using NVSwitch on GPU Server.
- GPU Server RHEL OS and GPU driver versions have been added.
- Add GPU Server feature
- NAT configuration feature has been added to the GPU Server.
- Samsung Cloud Platform Common Feature Changes
- Account, IAM, Service Home, tags, and other common CX changes have been applied.
- We have officially launched the GPU Server service.
- We have launched a virtualization computing service that lets you allocate and use infrastructure resources such as CPU, GPU, and memory provided by the server as needed, without having to purchase them individually.