This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

GPU Server

1: Overview

1.1: Server type
1.2: Monitoring Metrics
1.3: ServiceWatch Metrics

2: How-to guides

2.1: Manage Image
2.2: Manage Keypair
2.3: Use Multi-instance GPU on GPU Server
2.4: Use NVSwitch on GPU Server
2.5: Install ServiceWatch Agent

3: API Reference

4: CLI Reference

5: Release Note

1 - Overview

Service Overview

GPU Server is a virtualized computing service that lets you freely allocate and use infrastructure resources such as CPU, GPU, and memory provided by the server, without having to purchase them individually, and allocate as much as needed at the required time. It is suitable for tasks that require fast computation speed, such as AI model experiments, predictions, and inference in a cloud environment, and you can flexibly select and use resources with optimized performance according to the type and scale of the work. The GPU Server provides the following features.

Provided features

GPU Server Management: Through a web-based console, users can directly Self Service create, delete, and modify GPU servers from provisioning to monitoring and billing.
Product offering by GPU quantity: Depending on the project’s purpose and scale, you can freely select the number of H100/A100 GPUs to configure a virtual server.
High‑Performance GPU Provision: We provide high‑performance GPU servers at physical‑server level using a pass‑through method.
Storage Connection: Provides additional attached storage besides the OS disk. * Block Storage, File Storage, Object Storage can be connected and used.
Strong Security Enforcement: Through the Security Group service, control inbound/outbound traffic exchanged with the external Internet or other VPC (Virtual Private Cloud) to securely protect the server.
Monitoring: You can view monitoring information such as the status of CPU, Memory, Disk, and GPU, which are computing resources, through the Cloud Monitoring service.
Network Configuration Management: The server’s subnet/IP can be conveniently changed from the values set at initial creation. * NAT IP provides a management feature that lets you enable or disable it as needed.
Key Pair method: To ensure a secure OS access method, we provide a Key Pair method instead of ID/PW login.
Image Management: You can create and manage Custom Images, and it provides sharing functionality between projects.
ServiceWatch Service Integration: You can monitor data through the ServiceWatch service.

Components

GPU Server provides GPUs, NVSwitch, and NVLink on top of virtualized computing resources.

caution

NVSwitch can be enabled and used only for instance types that allocate eight GPUs on a single GPU server.

Specifications by GPU Type

GPU (Graphic Processing Unit) performs the calculations required to generate the images that compose a computer screen, and because it is specialized for parallel processing, it can handle large amounts of data quickly, processing large‑scale parallel operations such as artificial intelligence (AI) and data analysis. The specifications of the GPU Types provided by the GPU Server service are as follows.

Category	A100 Type	H100 Type	B300 Type
GPU Architecture	NVIDIA Ampere	NVIDIA Hopper	NVIDIA Blackwell Ultra
GPU Memory	80 GiB	80 GiB	268 GiB
GPU Transistors	54 billion 7N TSMC	80 billion 4N TSMC	208 billion 4NP TSMC
FP16 Tensor Core (Dense)	312 TFLOPs	989 TFLOPs	2.25 PFLOPs
FP8 Tensor Core (Dense)	Unsupported	1,979 TFLOPs	4.5 PFLOPs
FP4 Tensor Core (Dense)	Unsupported	Unsupported	13.5 PFLOPs
GPU Memory Bandwidth	2,039 GB/s HBM2e	3,352 GB/s HBM3	8 TB/s HBM3e
NVLink performance	NVLink 3	NVLink 4	NVLink 5
NVLink Signaling Rate	25 GB/s (x12)	25 GB/s (x18)	50 GB/s (x18)
NVSwitch GPU-to-GPU bandwidth	600 GB/s	900 GB/s	1.8 TB/s
Total NVSwitch aggregate bandwidth	4.8 TB/s	7.2 TB/s	14.4 TB/s

Table. Specifications by GPU Type

NPU Type Specifications

The NPU (Network Processing Unit) is a processor specialized for AI inference operations, performing generative AI and various AI inference workloads based on high throughput and power efficiency. The specifications of the NPU Type provided by the GPU Server service are as follows.

Category	Furiosa RNGD
Architecture	Tensor Contraction Processer
BF16	256 TFLOPS
FP8	512 TFLOPS
Memory Bandwidth	HBM3 1.5 TB/s
Memory Capacity	HBM3 48 GB
Interconnect Interface	PCIe Gen5 x16

Table. Specifications by NPU Type

Server type

The server types offered by the GPU Server are as follows. For detailed information about the server types provided by GPU Server, see GPU Server 서버 타입.

Category	Server type	CPU vCore	Memory(GB)	GPU/NPU quantity
GPU-A100-1	g1v16a1	16	234	1
GPU-A100-1	g1v32a2	32	468	2
GPU-A100-1	g1v64a4	64	936	4
GPU-A100-1	g1v128a8	128	1,872	8
GPU-H100-2	g2v12h1	12	234	1
GPU-H100-2	g2v24h2	24	468	2
GPU-H100-2	g2v48h4	48	936	4
GPU-H100-2	g2v96h8	96	1,872	8
GPU-B300-3	g3v16b1	16	480	1
GPU-B300-3	g3v32b2	32	960	2
GPU-B300-3	g3v64b4	64	1,920	4
GPU-B300-3	g3v128b8	128	3,840	8
NPU-RNGD-1	n1v8r1	8	106	1
NPU-RNGD-1	n1v16r2	16	212	2
NPU-RNGD-1	n1v32r4	32	424	4
NPU-RNGD-1	n1v64r8	64	848	8

Table. GPU Server server type

OS and driver version

The operating systems (OS) supported by the GPU Server are as follows. Please be aware that B300-type GPUs are supported only from a certain GPU version onward, so choose your image accordingly.

OS	OS version	Driver version	Server type classification
Ubuntu	24.04	ND 580.126.20	GPU-B300-3, GPU-H100-2, GPU-A100-1
Ubuntu	24.04	ND 570.195.03	GPU-H100-2, GPU-A100-1
Ubuntu	24.04	FRD 2026.2.0	NPU-RNGD-1
Ubuntu	22.04	ND 535.183.06	GPU-H100-2, GPU-A100-1
RHEL	9.6	ND 580.126.20	GPU-B300-3, GPU-H100-2, GPU-A100-1
RHEL	8.1	ND 580.126.20	GPU-B300-3, GPU-H100-2, GPU-A100-1
RHEL	8.1	ND 535.183.06	GPU-H100-2, GPU-A100-1

Table. GPU Server OS and driver version

Constraints

The GPU Server has the following constraints.

We are providing Ubuntu22.04 as the current OS version.
GPU is provided via Pass-through.
Physical GPU cards are created in units of 1, 2, 4, or 8 cards each. * –>

Preceding Service

This service must be installed in advance before creating it. Please prepare by referring to the user guide provided in advance.

Service Category	service	Detailed description
Networking	VPC	A service that provides an isolated virtual network in a cloud environment
Networking	Security Group	Virtual firewall that controls server traffic

Table. GPU Server Preliminary Service

1.1 - Server type

GPU Server server type

GPU servers are categorized by the GPU type they offer, and the GPU used in a GPU server is determined by the server type selected when creating the GPU server. Please select the server type based on the specifications of the application you want to run on the GPU server.

The server types supported by the GPU Server are as follows.

GPU-H100-2 g2v12h1

Category	Example	Detailed description
Server type	GPU-H100-2	Provided server type classification GPU-H100-2 GPU-H100 indicates the provided GPU type 2 indicates the generation GPU-A100-1 GPU-A100 indicates the provided GPU type 1 indicates the generation
Server specifications	g2	제공되는 서버 타입 구분 및 세대 g2 g는 GPU 서버 사양을 의미 2은 세대를 의미
서버 사양	v12	vCore 개수 v2: 2개의 가상 코어
서버 사양	h1	GPU 종류와 수량 h1 h는 GPU-H100를 의미 1은 GPU 1개를 의미 a2 a는 GPU-A100를 의미 2은 GPU 2개를 의미

Table. GPU Server server type format

g1 server type

The g1 server type is a GPU Server that uses the NVIDIA A100 Tensor Core GPU, making it suitable for high‑performance applications.

Up to 8 NVIDIA A100 Tensor Core GPUs provided
Equipped with 6,912 CUDA cores and 432 Tensor cores per GPU.
Supports up to 128 vCPUs and 1,920 GB of memory
Maximum networking speed of 40 Gbps
600 GB/s GPU and NVIDIA NVSwitch P2P communication
Additional AMD CPU-based server type provided (GPU-A100-1-A)

구분	서버 타입	GPU	CPU	Memory	GPU Memory	Network Bandwidth
GPU-A100-1	g1v16a1	1	16 vCore	234 GB	80 GiB	최대 20 Gbps
GPU-A100-1	g1v32a2	2	32 vCore	468 GB	160 GiB	Maximum 20 Gbps
GPU-A100-1	g1v64a4	4	64 vCore	936 GB	320 GiB	Up to 40 Gbps
GPU-A100-1	g1v128a8	8	128 vCore	1,872 GB	640 GiB	Maximum 40 Gbps
GPU-A100-1-A	g1av16a1	1	16 vCore	234 GB	80 GiB	Maximum 20 Gbps
GPU-A100-1-A	g1av32a2	2	32 vCore	468 GB	160 GiB	Maximum 20 Gbps
GPU-A100-1-A	g1av64a4	4	64 vCore	936 GB	320 GiB	최대 40 Gbps
GPU-A100-1-A	g1av128a8	8	128 vCore	1,872 GB	640 GiB	최대 40 Gbps

Table. GPU Server server type > GPU-A100-1 server type

g2 server type

The g2 server type is a GPU Server that uses the NVIDIA H100 Tensor Core GPU, making it suitable for high-performance applications.

Up to 8 NVIDIA H100 Tensor Core GPUs provided
Equipped with 16,896 CUDA cores and 528 Tensor cores per GPU
Supports up to 96 vCPUs and 1,920 GB of memory
Maximum networking speed of 40 Gbps
900 GB/s GPU and NVIDIA NVSwitch P2P communication

구분	서버 타입	GPU	CPU	Memory	GPU Memory	Network Bandwidth
GPU-H100-2	g2v12h1	1	12 vCore	234 GB	80 GiB	Maximum 20 Gbps
GPU-H100-2	g2v24h2	2	24 vCore	468 GB	160 GiB	Maximum 20 Gbps
GPU-H100-2	g2v48h4	4	48 vCore	936 GB	320 GiB	Maximum 40 Gbps
GPU-H100-2	g2v96h8	8	96 vCore	1,872 GB	640 GiB	Maximum 40 Gbps

Table. GPU Server server type > GPU-H100-2 server type

g3 server type

The g3 server type is a GPU Server that uses the NVIDIA B300 Tensor Core GPU, making it suitable for high-performance applications.

Provides up to 8 NVIDIA B300 Tensor Core GPUs
Equipped with 20,480 CUDA cores and 640 Tensor cores per GPU.
Supports up to 128 vCPUs and 3,840 GB of memory.
Up to 40 Gbps networking speed
1.8 TB/s GPU and NVIDIA NVSwitch P2P communication

Category	Server type	GPU	CPU	Memory	GPU Memory	Network Bandwidth
GPU-B300-3	g3v16b1	1	16 vCore	480 GB	268 GiB	Maximum 20 Gbps
GPU-B300-3	g3v32b2	2	32 vCore	960 GB	536 GiB	Maximum 20 Gbps
GPU-B300-3	g3v64b4	4	64 vCore	1,920 GB	1,072 GiB	Up to 40 Gbps
GPU-B300-3	g3v128b8	8	128 vCore	3,840 GB	2,144 GiB	최대 40 Gbps

Table. GPU Server server type > GPU-B300-3 server type

n1 server type

The n1 server type is an NPU Server that uses the Furiosa RNGD NPU, making it suitable for AI inference workloads.

Provides up to 8 Furiosa RNGD NPUs
Supports 256 TFLOPS and 512 TFLOPS per NPU
Supports up to 64 vCPUs and 848 GB of memory
Maximum networking speed of 40 Gbps
PCIe Gen5 x16 interface support

구분	서버타입	NPU	CPU	Memory	NPU Memory	Network Bandwidth
NPU-RNGD-1	n1v8r1	1	8 vCore	106 GB	48 GiB	Maximum 20 Gbps
NPU-RNGD-1	n1v16r2	2	16 vCore	212 GB	96 GiB	Maximum 20 Gbps
NPU-RNGD-1	n1v32r4	4	32 vCore	424 GB	192 GiB	Up to 40 Gbps
NPU-RNGD-1	n1v64r8	8	64 vCore	848 GB	384 GiB	Maximum 40 Gbps

Table. GPU Server server type > NPU-RNGD-1 server type

1.2 - Monitoring Metrics

Cloud Monitoring service termination notice

According to Samsung Cloud Platform’s policy, the Cloud Monitoring service is scheduled to be discontinued in September 2026.
Accordingly, after the September 2026 release, resource monitoring of the Samsung Cloud Platform via Cloud Monitoring will no longer be possible.

With the new alternative service, you can continuously perform resource monitoring by leveraging ServiceWatch released in October 2025.
ServiceWatch provides more modern and powerful features, replacing Cloud Monitoring to deliver a seamless monitoring environment.

If you are collecting metrics and logs through the Cloud Monitoring Agent, you need to switch to the ServiceWatch Agent.

For detailed information about ServiceWatch, please refer to ServiceWatch Overview.
Detailed information about ServiceWatch Agent can be found in the ServiceWatch Agent.

GPU Server Monitoring Metrics

The table below shows the monitoring metrics of the GPU server that can be viewed through Cloud Monitoring.

Even without installing the Agent, it provides basic monitoring metrics and the table below. Please check the GPU Server monitoring metrics (provided by default). Additionally, the metrics that can be viewed by installing the Agent are in the table. Please refer to the additional monitoring metrics for GPU Server (Agent installation required).

For detailed usage of Cloud Monitoring, refer to the Cloud Monitoring guide.

Performance Item Name	Explanation	unit
Memory Total [Basic]	bytes of usable memory	bytes
Memory Used [Basic]	bytes of currently used memory	bytes
Memory Swap In [Basic]	bytes of the replaced memory	bytes
Memory Swap Out [Basic]	bytes of the replaced memory	bytes
Memory Free [Basic]	bytes of unused memory	bytes
Disk Read Bytes [Basic]	Read bytes	bytes
Disk Read Requests [Basic]	Number of read requests	cnt
Disk Write Bytes [Basic]	write bytes	bytes
Disk Write Requests [Basic]	Number of write requests	cnt
CPU Usage [Basic]	Average system CPU usage over 1 minute	%
Instance State [Basic]	Instance status	state
Network In Bytes [Basic]	Received bytes	bytes
Network In Dropped [Basic]	Incoming packet drop	cnt
Network In Packets [Basic]	Number of received packets	cnt
Network Out Bytes [Basic]	sent bytes	bytes
Network Out Dropped [Basic]	Transmit packet drop	cnt
Network Out Packets [Basic]	Number of transmitted packets	cnt

Table. GPU Server Basic Monitoring Metrics (Provided by Default)

Performance Item Name	Explanation	unit
GPU Count	Number of GPUs	cnt
GPU Memory Usage	Memory usage rate	%
GPU Memory Used	Memory usage	MB
GPU Temperature	GPU temperature	℃
GPU Usage	utilization	%
GPU Usage [Avg]	Overall average GPU utilization (%)	%
GPU Power Cap	Maximum power capacity of the GPU	W
GPU Power Usage	Current GPU power usage	W
GPU Memory Usage [Avg]	GPU Memory Uti. AVG	%
GPU Count in use	Number of GPUs in use by jobs on the node	cnt
Execution Status for nvidia-smi	Result of running the nvidia-smi command	status
Core Usage [IO Wait]	Ratio of CPU time spent in wait state (disk wait)	%
Core Usage [System]	Proportion of CPU time spent in kernel space	%
Core Usage [User]	Proportion of CPU time spent in user space	%
CPU Cores	Number of CPU cores on the host	cnt
CPU Usage [Active]	Percentage of CPU time used, excluding Idle and IOWait states	%
CPU Usage [Idle]	It is the proportion of CPU time spent in idle state.	%
CPU Usage [IO Wait]	The proportion of CPU time spent in a waiting state (disk wait).	%
CPU Usage [System]	Percentage of CPU time used by the kernel	%
CPU Usage [User]	Percentage of CPU time used in user space.	%
CPU Usage/Core [Active]	Percentage of CPU time used other than Idle and IOWait states	%
CPU Usage/Core [Idle]	It is the proportion of CPU time spent in idle state.	%
CPU Usage/Core [IO Wait]	This is the proportion of CPU time spent in a waiting state (disk wait).	%
CPU Usage/Core [System]	Percentage of CPU time used by the kernel	%
CPU Usage/Core [User]	Percentage of CPU time used in user space.	%
Disk CPU Usage [IO Request]	Proportion of CPU time during which I/O requests to the device were executed	%
Disk Queue Size [Avg]	The average queue length of requests executed for the device.	num
Disk Read Bytes	The number of bytes read per second from the device.	bytes
Disk Read Bytes [Delta Avg]	Average of system.diskio.read.bytes_delta for individual disks	bytes
Disk Read Bytes [Delta Max]	Maximum system.diskio.read.bytes_delta of individual disks	bytes
Disk Read Bytes [Delta Min]	Minimum system.diskio.read.bytes_delta of individual disks	bytes
Disk Read Bytes [Delta Sum]	Sum of the system.diskio.read.bytes_delta of individual disks	bytes
Disk Read Bytes [Delta]	Delta of the system.diskio.read.bytes value for each disk	bytes
Disk Read Bytes [Success]	Total number of bytes successfully read.	bytes
Disk Read Requests	Number of read requests to the disk device per second	cnt
Disk Read Requests [Delta Avg]	Average of the system.diskio.read.count_delta for individual disks	cnt
Disk Read Requests [Delta Max]	Maximum system.diskio.read.count_delta for individual disks	cnt
Disk Read Requests [Delta Min]	Minimum of system.diskio.read.count_delta for individual disks	cnt
Disk Read Requests [Delta Sum]	Sum of system.diskio.read.count_delta for individual disks	cnt
Disk Read Requests [Success Delta]	Delta of system.diskio.read.count for each disk	cnt
Disk Read Requests [Success]	Total number of successful reads	cnt
Disk Request Size [Avg]	The average size of requests executed on the device (unit: sectors).	num
Disk Service Time [Avg]	Average service time (milliseconds) of input requests executed on the device.	ms
Disk Wait Time [Avg]	Average time taken for requests executed on the supported device.	ms
Disk Wait Time [Read]	Average disk wait time	ms
Disk Wait Time [Write]	Average disk wait time	ms
Disk Write Bytes [Delta Avg]	Average of system.diskio.write.bytes_delta for each disk	bytes
Disk Write Bytes [Delta Max]	Maximum system.diskio.write.bytes_delta of individual disks	bytes
Disk Write Bytes [Delta Min]	Minimum of system.diskio.write.bytes_delta for individual disks	bytes
Disk Write Bytes [Delta Sum]	Sum of system.diskio.write.bytes_delta for individual disks	bytes
Disk Write Bytes [Delta]	Delta of the system.diskio.write.bytes value for each disk	bytes
Disk Write Bytes [Success]	Total number of bytes successfully written.	bytes
Disk Write Requests	Number of write requests to the disk device per second	cnt
Disk Write Requests [Delta Avg]	Average of system.diskio.write.count_delta for individual disks	cnt
Disk Write Requests [Delta Max]	Maximum system.diskio.write.count_delta for individual disks	cnt
Disk Write Requests [Delta Min]	Minimum of system.diskio.write.count_delta for individual disks	cnt
Disk Write Requests [Delta Sum]	Sum of the system.diskio.write.count_delta of individual disks	cnt
Disk Write Requests [Success Delta]	Delta of system.diskio.write.count for each disk	cnt
Disk Write Requests [Success]	Total number of successful writes	cnt
Disk Writes Bytes	It is the number of bytes per second written to the device.	bytes
Filesystem Hang Check	filesystem (local/NFS) hang check (normal:1, abnormal:0)	status
Filesystem Nodes	It is the total number of file nodes in the file system.	cnt
Filesystem Nodes [Free]	It is the total number of available file nodes in the file system.	cnt
Filesystem Size [Available]	Disk space (bytes) available to unauthorized users	bytes
Filesystem Size [Free]	Available disk space (bytes)	bytes
Filesystem Size [Total]	Total disk space (bytes)	bytes
Filesystem Usage	Used disk space percentage	%
Filesystem Usage [Avg]	Average of individual filesystem.used.pct values	%
Filesystem Usage [Inode]	inode usage	%
Filesystem Usage [Max]	Maximum among individual filesystem.used.pct	%
Filesystem Usage [Min]	minimum among individual filesystem.used.pct	%
Filesystem Usage [Total]	-	%
Filesystem Used	Used disk space (bytes)	bytes
Filesystem Used [Inode]	inode usage	bytes
Memory Free	Total amount of available memory (bytes).	bytes
Memory Free [Actual]	Actual usable memory (bytes).	bytes
Memory Free [Swap]	Available swap memory.	bytes
Memory Total	total memory	bytes
Memory Total [Swap]	Total swap memory.	bytes
Memory Usage	Percentage of used memory	%
Memory Usage [Actual]	Percentage of memory actually used	%
Memory Usage [Cache Swap]	cached swap usage rate	%
Memory Usage [Swap]	Percentage of used swap memory	%
Memory Used	used memory	bytes
Memory Used [Actual]	Actual memory used (bytes).	bytes
Memory Used [Swap]	Swap memory used.	bytes
Collisions	Network collision	cnt
Network In Bytes	Number of received bytes	bytes
Network In Bytes [Delta Avg]	Average of system.network.in.bytes_delta for individual networks	bytes
Network In Bytes [Delta Max]	Maximum of system.network.in.bytes_delta for each network	bytes
Network In Bytes [Delta Min]	Minimum system.network.in.bytes_delta for each network	bytes
Network In Bytes [Delta Sum]	Sum of system.network.in.bytes_delta for individual networks	bytes
Network In Bytes [Delta]	Delta of received byte count	bytes
Network In Dropped	Number of deleted packets among incoming packets	cnt
Network In Errors	Number of errors during reception	cnt
Network In Packets	Number of received packets	cnt
Network In Packets [Delta Avg]	Average of system.network.in.packets_delta for each network	cnt
Network In Packets [Delta Max]	Maximum of system.network.in.packets_delta for each network	cnt
Network In Packets [Delta Min]	Minimum of system.network.in.packets_delta for individual networks	cnt
Network In Packets [Delta Sum]	Sum of system.network.in.packets_delta for individual networks	cnt
Network In Packets [Delta]	Delta of received packet count	cnt
Network Out Bytes	Number of transmitted bytes	bytes
Network Out Bytes [Delta Avg]	Average of system.network.out.bytes_delta for each network	bytes
Network Out Bytes [Delta Max]	Maximum of system.network.out.bytes_delta for individual networks	bytes
Network Out Bytes [Delta Min]	Minimum of system.network.out.bytes_delta for individual networks	bytes
Network Out Bytes [Delta Sum]	Sum of system.network.out.bytes_delta for individual networks	bytes
Network Out Bytes [Delta]	Delta of transmitted byte count	bytes
Network Out Dropped	Number of deleted packets among outgoing packets.	cnt
Network Out Errors	Number of errors during transmission	cnt
Network Out Packets	Number of transmitted packets	cnt
Network Out Packets [Delta Avg]	Average of system.network.out.packets_delta for each network	cnt
Network Out Packets [Delta Max]	Maximum of system.network.out.packets_delta for each network	cnt
Network Out Packets [Delta Min]	Minimum of system.network.out.packets_delta for each network	cnt
Network Out Packets [Delta Sum]	Sum of system.network.out.packets_delta for individual networks	cnt
Network Out Packets [Delta]	Delta of transmitted packet count	cnt
Open Connections [TCP]	All open TCP connections	cnt
Open Connections [UDP]	All open UDP connections	cnt
Port Usage	Available port usage rate	%
SYN Sent Sockets	Number of sockets in SYN_SENT state (when connecting from local to remote)	cnt
Kernel PID Max	kernel.pid_max value	cnt
Kernel Thread Max	kernel.threads-max value	cnt
Process CPU Usage	Percentage of CPU time consumed by the process since the last update.	%
Process CPU Usage/Core	Percentage of CPU time used by the process since the last event.	%
Process Memory Usage	Proportion of main memory (RAM) occupied by a process	%
Process Memory Used	Resident Set size. The amount of memory a process occupies in RAM.	bytes
Process PID	process pid	PID
Process PPID	parent process PID	PID
Processes [Dead]	Number of dead processes	cnt
Processes [Idle]	Number of idle processes	cnt
Processes [Running]	Number of running processes	cnt
Processes [Sleeping]	Number of sleeping processes	cnt
Processes [Stopped]	stopped processes count	cnt
Processes [Total]	Total number of processes	cnt
Processes [Unknown]	Number of processes with an unsearchable or unknown status	cnt
Processes [Zombie]	Zombie processes count	cnt
Running Process Usage	process usage rate	%
Running Processes	Number of running processes	cnt
Running Thread Usage	Thread usage rate	%
Running Threads	Total number of threads running in running processes	cnt
Context Switches	context switch count (per second)	cnt
Load/Core [1 min]	The load over the last 1 minute divided by the number of cores	cnt
Load/Core [15 min]	The load over the last 15 minutes divided by the number of cores	cnt
Load/Core [5 min]	The load over the last 5 minutes divided by the number of cores	cnt
Multipaths [Active]	External storage connection path status = active count	cnt
Multipaths [Failed]	External storage connection path status = failed count	cnt
Multipaths [Faulty]	External storage connection path status = faulty count	cnt
NTP Offset	measured offset of the last sample (the time difference between the NTP server and the local environment)	num
Run Queue Length	Execution queue length	num
Uptime	OS uptime (milliseconds).	ms
Context Switchies	CPU context switch count (per second)	cnt
Disk Read Bytes [Sec]	Number of bytes read from a Windows logical disk in 1 second	cnt
Disk Read Time [Avg]	Average data read time (seconds)	sec
Disk Transfer Time [Avg]	Disk average wait time	sec
Disk Usage	Disk usage	%
Disk Write Bytes [Sec]	Number of bytes written in one second on a Windows logical disk	cnt
Disk Write Time [Avg]	Average data write time (seconds)	sec
Pagingfile Usage	Paging file usage	%
Pool Used [Non Paged]	Nonpaged Pool usage in kernel memory	bytes
Pool Used [Paged]	Paged Pool usage in kernel memory	bytes
Process [Running]	Number of currently running processes	cnt
Threads [Running]	Number of currently running threads	cnt
Threads [Waiting]	Number of threads waiting for processor time	cnt

Table. Additional monitoring metrics for GPU Server (Agent installation required)

1.3 - ServiceWatch Metrics

The GPU Server sends metrics to ServiceWatch. The metrics provided by default monitoring are data collected at 5‑minute intervals. If detailed monitoring is enabled, you can view data collected at 1‑minute intervals.

information

The basic and detailed monitoring of the GPU Server are provided with the same metrics as the Virtual Server, and the namespace is also provided as Virtual Server.
GPU-related metrics are provided through ServiceWatch Agent, and for instructions on collecting metrics using ServiceWatch Agent, refer to the ServiceWatch Agent guide.

Reference

For how to view metrics in ServiceWatch, refer to the ServiceWatch guide.

Refer to How-to guides > ServiceWatch Detailed Monitoring Activation for how to enable detailed monitoring of the GPU Server.

Basic Metrics

The following are the basic metrics for the Virtual Server namespace.

The indicators whose names are displayed in bold below are the key indicators selected among the basic indicators provided by Virtual Server. The key metrics are used to build service dashboards that are automatically created for each service in ServiceWatch.

Each metric indicates through the user guide which statistical value is meaningful to view for that metric, and among the meaningful statistics, the statistical values shown in bold text are the primary statistics. In the service dashboard, you can view primary metrics using the primary statistical values.

Performance items	Detailed description	unit	meaningful statistics
Instance State	Instance status display 1 - Active 0 - Off	None	Total
CPU Usage	CPU usage	Percent	Average Maximum Minimum
Disk Read Bytes	Bytes read from block device (bytes)	Bytes	Total Average Maximum Minimum
Disk Read Requests	Number of read requests on a block device	Count	Total Average Maximum Minimum
Disk Write Bytes	Write capacity (bytes) on block device	Bytes	Total Average Maximum Minimum
Disk Write Requests	Number of write requests on block device	Count	Total Average Maximum Minimum
Network In Bytes	Received bytes (capacity) on the network interface	Bytes	Total Average Maximum Minimum
Network In Dropped	Number of packet drops received on the network interface	Count	Total Average Maximum Minimum
Network In Packets	Number of packets received on the network interface	Count	Total Average Maximum Minimum
Network Out Bytes	Data transmitted on the network interface (bytes)	Bytes	Total Average Maximum Minimum
Network Out Dropped	Number of packet drops transmitted from the network interface	Count	Total Average Maximum Minimum
Network Out Packets	Number of packets transmitted on the network interface	Count	Total Average Maximum Minimum

Table. Virtual Server Basic Metrics

2 - How-to guides

Users can create the service by entering the required GPU Server information and selecting detailed options through the Samsung Cloud Platform Console.

Create GPU Server

You can create and use a GPU Server service from the Samsung Cloud Platform Console.

To create a GPU Server, follow these steps.

Click the All Services > Compute > GPU Server menu. 1. Navigate to the Service Home page of the GPU Server.
On the Service Home page, click the Create GPU Server button. 2. GPU Server Creation Go to the page.

GPU Server Creation page, input the information required to create the service, and select detailed options.

Select the required information in the Image and Version Selection area.

Category	required status	Detailed description
Image	Required	Select the provided Image type Standard: Samsung Cloud Platform standard Image RHEL, Ubuntu Custom: User-created Image Kubernetes: Kubernetes Image Ubuntu
Image version	Required	Select the version of the chosen Image Provide a list of versions for the offered server Image For detailed information about the provided server image, refer to OS 및 GPU 드라이버 버전

Category

required status

Detailed description

Image

Required

Select the provided Image type

Standard: Samsung Cloud Platform standard Image
- RHEL, Ubuntu

Custom: User-created Image

Kubernetes: Kubernetes Image
- Ubuntu

Image version

Required

Select the version of the chosen Image

Provide a list of versions for the offered server Image

For detailed information about the provided server image, refer to OS 및 GPU 드라이버 버전

Table. GPU Server image and version selection input fields

Enter or select the required information in the Service Information Input area.

Category	required status	Detailed description
Number of servers	Required	Number of GPU Server instances to create concurrently only numeric input allowed, enter a value between 1 ~ 100
Service Type > Server Type	Required	GPU Server server type Indicates the specifications of a GPU-type server, and select a server that includes 1, 2, 4, or 8 GPUs For detailed information about the server types offered by GPU Server, refer to GPU Server 서버 타입
Service Type > Planned Compute	Selection	Status of resources with Planned Compute configured In Use: Number of resources with Planned Compute that are currently in use Configured: Number of resources with Planned Compute configured Coverage preview: Amount applied per resource by Planned Compute Apply for Planned Compute service: Navigate to the Planned Compute service application page For more details, see Planned Compute 신청하기
Block Storage	Required	Set the Block Storage used by the GPU Server according to its purpose Default: Area where the OS is installed and used Capacity can be entered in Unit increments (minimum capacity varies by OS image type) RHEL: values between 3 and 1,536 are allowed Ubuntu: values between 3 and 1,536 are allowed SSD: high‑performance general volume HDD: general volume SSD/HDD_KMS: additional encrypted volume that uses Samsung Cloud Platform KMS (Key Management System) encryption keys Encryption can be applied only at initial creation (cannot be changed after creation) Performance degradation may occur when using the SSD_KMS disk type SSD_Provisioned: SSD volume with configurable IOPS and Throughput Additional: used when extra space beyond the OS area is needed After selecting Use, enter the storage type and capacity To add storage, click the + button (up to 25 can be added); to delete, click the x button Capacity can be entered as a value between 1 and 1,536 in Unit increments Since 1 Unit equals 8 GB, 8 ~ 12,288 GB can be created SSD: high‑performance general volume HDD: general volume SSD/HDD_KMS: additional encrypted volume that uses Samsung Cloud Platform KMS (Key Management System) encryption keys Encryption can be applied only at initial creation (cannot be changed after creation) Performance degradation may occur when using the SSD_KMS disk type HDD/SSD_MultiAttach: volume that can be attached to two or more servers SSD_Provisioned: SSD volume with configurable IOPS and Throughput For details on each Block Storage type, see Block Storage 생성하기 Delete on termination: If Delete on Termination is set to Use, the volume will be terminated together with the server Volumes that have snapshots are not deleted even when Delete on termination is set to Use A multi‑attach volume can be deleted only when the server being removed is the last remaining server attached to the volume
Max IOPS	Required	Enter the maximum IOPS value between 5,000 and 20,000 Disk type can be set only if it is SSD_Provisioned
Max Throughput	Required	Enter the maximum Throughput value between 250~1,000 Can be set only when the disk type is SSD_Provisioned

Table. GPU Server Service Configuration Items

In the Required Information Input area, enter or select the required information.

Category	required status	Detailed description
Server name	Required	If the number of selected servers is 1, enter a name to distinguish the server Set the hostname to the entered server name Enter within 63 characters using English letters, numbers, spaces, and special characters (`-` `_`)
Server name Prefix	Required	Enter a Prefix to distinguish each server generated when the selected number of servers is 2 or more Automatically generated as the user input value (prefix) + ‘`-#`’ format Enter within 59 characters using English letters, numbers, spaces, and special characters (`-`, `_`)
Network Settings > Create New Network Port	Required	Set the network where the GPU Server will be installed Select a pre-created VPC. General Subnet: Select a pre-created general Subnet IP can be set to auto-generate or user input, and if input is selected, the user can directly enter the IP NAT: Available only when there is a single server and the VPC is connected to an Internet Gateway. Checking Use allows selection of a NAT IP NAT IP: Select a NAT IP If there is no NAT IP to select, click the Create New button to generate a Public IP Refresh button to view and select the created Public IP Creating a Public IP incurs charges according to the Public IP pricing policy Local Subnet (optional): Select Use for the local Subnet It is not a required element for creating the service A pre-created local Subnet must be selected IP can be set to auto-generate or user input, and selecting Input allows the user to directly enter the IP
Network Settings > Assign Existing Network Port	Required	Set the network where the GPU Server will be installed Select a pre‑created VPC General Subnet: Select a pre‑created general Subnet and Port NAT: Available only when there is a single server and the VPC is connected to an Internet Gateway; checking the option allows you to select a NAT IP. NAT IP: Select a NAT IP If no NAT IP is available to select, click the Create New button to generate a Public IP Refresh button to view and select the created Public IP Local Subnet (Optional): Select Use for the local Subnet Select a pre‑created local Subnet and Port
Security Group	Selection	Settings required to connect to the server Selection: You can select up to 5 pre‑created Security Groups Create New: If there is no Security Group to apply, create one separately in the Security Group service If you do not configure a Security Group, all connections will be blocked, so you must set it to allow required access
Keypair	Required	User authentication method to use when connecting to the server New creation: Create a new one when a new Keypair is required Refer to Keypair 생성하기 Default login account list by OS RHEL: cloud-user Ubuntu: ubuntu

Table. GPU Server required information input items

Additional Information Input area, enter or select the required information.

Category	required status	Detailed description
Lock	Selection	Lock usage setting When using Lock, it prevents actions such as server termination, start, and stop from being executed, thereby avoiding malfunctions caused by mistakes
Init script	Selection	Script executed when the server starts The init script must be written as a Batch script for Windows, a Shell script for Linux, or cloud‑init, depending on the image type. Up to 45,000 bytes can be entered
tag	Selection	Add Tag Up to 50 per resource can be added After clicking the Add Tag button, enter or select Key, Value values

Category

required status

Detailed description

Lock

Selection

Lock usage setting

When using Lock, it prevents actions such as server termination, start, and stop from being executed, thereby avoiding malfunctions caused by mistakes

Init script

Selection

Script executed when the server starts

The init script must be written as a Batch script for Windows, a Shell script for Linux, or cloud‑init, depending on the image type.

Up to 45,000 bytes can be entered

tag

Selection

Add Tag

Up to 50 per resource can be added

After clicking the Add Tag button, enter or select Key, Value values

Table. GPU Server Additional Information Input Items

Summary Check the detailed information and estimated charges generated in the panel, and click the Create button.
When the popup notifying creation opens, click the Confirm button.
- Once creation is complete, check the created resources on the GPU Server List page.

Check GPU Server detailed information

The GPU Server service allows you to view and edit the full resource list and detailed information. The GPU Server Details page consists of Details, Tags, Activity Log tabs. To view detailed information about the GPU Server service, follow these steps.

All Services > Compute > GPU Server Click the menu. 1. Navigate to the Service Home page of the GPU Server.
Click the GPU Server menu on the Service Home page. 2. GPU Server List Go to the page.

GPU Server List page, click the resource to view detailed information. 3. Navigate to the GPU Server Details page.

GPU Server Details page displays status information and additional feature information, and consists of Details, Tags, Activity History tabs.

For detailed information about GPU Server Additional Features, please refer to GPU Server 관리 부가 기능.

Category	Detailed description
GPU Server status	Status of the GPU Server created by the user Build: State where the Build command has been received Building: Build in progress Networking: Server creation process in progress Scheduling: Server creation process in progress Block_Device_Mapping: Connecting Block Storage during server creation Spawning: State where the server creation process is ongoing Active: Available state Powering_off: State when a stop request is made Deleting: Server deletion in progress Reboot_Started: Reboot in progress state Error: Error state Migrating: State where the server is migrating to another host Reboot: State where the Reboot command has been received Rebooting: Reboot in progress Rebuild: State where the Rebuild command has been received Rebuilding: State when a Rebuild request is made Rebuild_Spawning: State where the Rebuild process is ongoing Shutoff: State when Powering off is completed
Server control	Buttons to change server status Start: Start a stopped server Stop: Stop a running server Restart: Restart a running server
Image generation	Create a user custom image from the current server’s image
Console log	View the console log of the current server You can view the console logs output by the current server. For more details, see 콘솔 로그 확인하기.
Create dump	Create a dump of the current server The dump file is created inside the GPU Server For detailed dump creation instructions, see Dump 생성하기
Rebuild	All data and settings of the existing server are deleted, and a new server is set up For details, see Rebuild 수행하기.
Service cancellation	Cancel service button

Table. GPU Server status information and additional features

Notice

When using the mig feature, you must recheck the mig settings after the GPU Server’s Rebooting state has finished.

Detailed Information

GPU Server List page allows you to view detailed information of the selected resource and, if needed, modify the information.

Category	Detailed description
service	service name
Resource Type	Resource Type
SRN	Unique resource ID in Samsung Cloud Platform In the GPU Server service, it refers to the GPU Server SRN
Resource name	Resource name In the GPU Server service, it refers to the GPU Server name
Resource ID	Unique resource ID in the service
Constructor	User who created the service
Creation date and time	Service creation date and time
Modifier	User who edited the service information
Modification date and time	Date and time the service information was modified
Server name	server name
Server type	Display vCPU, memory, GPU information If you need to change to a different server type, click the Edit button to configure
Image name	OS image and version of the service
Lock	Display lock usage status If you need to change the Lock property value, click the Edit button to set it
Keypair name	User-configured server authentication information
Planned Compute	Resource status with Planned Compute configured For details, see Planned Compute 신청하기
LLM Endpoint	URL for using LLM
ServiceWatch detailed monitoring	When enabled, data monitoring is available in the ServiceWatch service You can set the activation status by clicking the Edit button For more information about the ServiceWatch service, see the ServiceWatch Overview
Network	Network information of the GPU Server VPC name, standard Subnet name, IP, NAT IP, NAT IP status, Security Group name If you need to change the NAT IP value, click the Edit button to configure If you need to change the Security Group, click the Edit button to configure Add as new network: select a standard Subnet and IP You can select another standard Subnet within the same VPC IP can be set to auto‑generate or manual entry; if manual is selected, the user can directly input the IP Add using existing port: select a pre‑created standard Subnet and port
Local Subnet	Local Subnet information of GPU Server Local Subnet name, Local Subnet IP, Security Group name If a Security Group change is required, you can click the Edit button to configure Add as new network: select Local Subnet and IP You can select a different local Subnet within the same VPC IP can be set to auto-generated or user input, and if Input is selected, the user enters the IP directly Add with existing port: select a pre-created local Subnet and port
Block Storage	Information of Block Storage attached to the server Volume ID, Volume Name, Type, Capacity, Connection Info, Category, Delete on termination, Status Add: Additional Block Storage can be attached if needed Edit Delete on termination: Modify the Delete on termination value More > Detach: Detach the selected Block Storage from the list

Table. GPU Server Detailed Information Tab Items

Caution

When using ServiceWatch Detailed Monitoring, additional charges will apply.

Category	Detailed description
Tag list	Tag list You can view the tag’s Key, Value information Up to 50 tags can be added per resource When entering tags, you can search and select from the list of previously created Keys and Values

Job History

GPU Server List page allows you to view the operation history of the selected resource.

Category	Detailed description
Task History List	Resource change history Operation timestamp, resource ID, resource name, operation details, event topic, operation result, check operator information

Table. Work History Tab Detailed Information Items

Control GPU Server Operation

If you need to control the operation of the created GPU Server resources, you can perform the task from the GPU Server List or GPU Server Details page. You can start, stop, or restart a running server.

Getting Started with GPU Server

You can start a GPU server that is shut off. To start the GPU Server, follow these steps.

All Services > Compute > GPU Server Click the menu. 1. Navigate to the Service Home page of the GPU Server.
Click the GPU Server menu on the Service Home page. 2. GPU Server List Go to the page.
GPU Server List page, click the resource to start among the shutoff servers, and navigate to the GPU Server Details page.
- On the GPU Server list page, you can start each resource via the right More button.
- After selecting multiple servers with the checkboxes, you can control multiple servers simultaneously using the Start button at the top.
On the GPU Server Details page, click the Start button at the top to start the server. 4. In the Status Indicator item, check the status of the changed server.
- When the GPU Server startup is complete, the server status changes from Shutoff to Active.
- For detailed information about the GPU Server status, please refer to GPU Server 상세 정보 확인하기.

Stop GPU Server

You can stop a GPU Server that is running (Active). To stop the GPU Server, follow the steps below.

All Services > Compute > GPU Server Click the menu. 1. Navigate to the Service Home page of the GPU Server.
Click the GPU Server menu on the Service Home page. 2. GPU Server List Go to the page.
On the GPU Server List page, click the resource to stop among the servers that are active (Active), and go to the GPU Server Details page.
- On the GPU Server List page, you can stop each resource using the right More button.
- After selecting multiple servers with the checkboxes, you can control multiple servers simultaneously using the Stop button at the top.
On the GPU Server Details page, click the Stop button at the top to start the server. 4. Check the status of the changed server in the Status Indicator item.
- When the GPU Server shutdown is complete, the server status changes from Active to Shutoff.
- For detailed information about the GPU Server status, refer to GPU Server 상세 정보 확인하기.

Restart GPU Server

You can restart the created GPU Server. To restart the GPU server, follow these steps.

All Services > Compute > GPU Server Click the menu. 1. Navigate to the Service Home page of the GPU Server.
Click the GPU Server menu on the Service Home page. 2. GPU Server List Go to the page.
GPU Server List page, click the resource to restart, and navigate to the GPU Server Details page.
- On the GPU Server List page, you can restart each resource through the right More button.
- After selecting multiple servers with the checkboxes, you can control multiple servers simultaneously using the Restart button at the top.
On the GPU Server Details page, click the Restart button at the top to start the server. 4. Check the status of the changed server in the Status Indicator item.
- During a GPU Server restart, the server status goes through Rebooting and finally changes to Active.
- For detailed information about the GPU Server status, please refer to GPU Server 상세 정보 확인하기.

GPU Server Resource Management

If you need server control and management functions for the created GPU Server resources, you can perform tasks on the GPU Server Resource List or GPU Server Details page.

Create Image

You can create an image of a running GPU server.

Reference

This document provides instructions on how to create a user custom image from the image of a running GPU server.

GPU Server List or GPU Server Details page, click the Create Image button to generate a user Custom Image.

To create an Image of the GPU Server, follow the steps below.

All Services > Compute > GPU Server Click the menu. 1. Navigate to the Service Home page of the GPU Server.
Click the GPU Server menu on the Service Home page. 2. GPU Server List Go to the page.
On the GPU Server List page, click the resource to create an Image. 3. Navigate to the GPU Server Details page.
On the GPU Server Details page, click the Create Image button. 4. Navigate to the Image creation page.
- Enter the required information in the Service Information Input area.
  Category
  required status
  Detailed description
  Image name Required Name of the image to be created
  English letters, numbers, spaces, and special characters(- _) using them, enter within 200 characters
  Table. Image service information input items
Check the input information and click the Generate button.
- Once creation is complete, check the created resources on the All Services > Compute > GPU Server > Image List page.

Category	required status	Detailed description
Image name	Required	Name of the image to be created English letters, numbers, spaces, and special characters(`-` `_`) using them, enter within 200 characters

Notice

When an Image is created, the generated Image is stored in the Object Storage used as internal storage. * Therefore, a Object Storage usage fee is charged for Image storage.
Since the file system of an image generated from an active GPU server cannot be guaranteed to be intact, it is recommended to stop the server before creating the image.

Enable detailed monitoring for ServiceWatch

By default, the GPU Server is linked to the basic monitoring of the ServiceWatch and Virtual Server namespaces. You can enable detailed monitoring as needed to more quickly identify operational issues and take action. For detailed information about ServiceWatch, refer to the ServiceWatch 개요.

Reference

GPU Server provides basic and detailed monitoring in the same namespace as Virtual Server. GPU Server’s GPU metrics will be provided by the ServiceWatch Agent. (Scheduled for December 2025)

Caution

Basic monitoring is provided for free, but enabling detailed monitoring incurs additional charges. Please note the usage.

To enable detailed ServiceWatch monitoring on the GPU Server, follow these steps.

All Services > Compute > GPU Server menu, click it. 1. Navigate to the Service Home page of the GPU Server.
Click the GPU Server menu on the Service Home page. 2. GPU Server List Go to the page.
On the GPU Server list page, click the resource to enable detailed ServiceWatch monitoring. 3. Navigate to the GPU Server Details page.
On the GPU Server Details page, click the Edit button for ServiceWatch detailed monitoring. 4. ServiceWatch Detailed Monitoring Edit Navigate to the popup window.
ServiceWatch Detailed Monitoring Edit In the popup window, after selecting Enable, review the instructions and click the Confirm button.
Check the ServiceWatch detailed monitoring items on the GPU Server Details page.

Disable detailed monitoring of ServiceWatch

Caution

Disabling detailed monitoring is required for cost efficiency. Maintain detailed monitoring only when absolutely necessary, and disable detailed monitoring otherwise.

To disable detailed monitoring of ServiceWatch on the GPU Server, follow these steps.

All Services > Compute > GPU Server Click the menu. 1. Go to the Service Home page of the GPU Server.
Click the GPU Server menu on the Service Home page. 2. GPU Server List Go to the page.
On the GPU Server List page, click the resource to disable ServiceWatch detailed monitoring. 3. Navigate to the GPU Server Details page.
GPU Server Details page, click the Edit button for ServiceWatch detailed monitoring. 4. ServiceWatch Detailed Monitoring Edit Navigate to the popup window.
ServiceWatch Detailed Monitoring Edit In the popup window, after deselecting Enable, review the guidance message and click the Confirm button.
Check the detailed ServiceWatch monitoring items on the GPU Server Details page.

GPU Server management additional features

For managing the GPU Server, you can view Console logs, generate Dumps, and perform Rebuilds. To view the Console logs, create a Dump, and perform a Rebuild on the GPU Server, follow these steps.

Check console log

You can view the current console log of the GPU Server.

To check the console log of the GPU Server, follow these steps.

All Services > Compute > GPU Server Click the menu. 1. Navigate to the Service Home page of the GPU Server.
Click the GPU Server menu on the Service Home page. 2. GPU Server List Go to the page.
On the GPU Server List page, click the resource to view the console log. 3. Navigate to the GPU Server Details page.
On the GPU Server Details page, click the Console Log button. 4. Console Log navigates to the popup window.
Console Log Check the console log displayed in the popup window.

Create Dump

To create a dump file on the GPU Server, follow these steps.

All Services > Compute > GPU Server Click the menu. 1. Navigate to the Service Home page of the GPU Server.
Click the GPU Server menu on the Service Home page. 2. GPU Server List Go to the page.
GPU Server List page, click the resource to view detailed information. 3. Navigate to the GPU Server Details page.
On the GPU Server Details page, click the Generate Dump button.
- The dump file is created inside the GPU server.

Execute Rebuild

You can delete all data and settings of the existing GPU Server and rebuild it on a new server.

To perform a Rebuild of the GPU Server, follow these steps.

All Services > Compute > GPU Server Click the menu. 1. Navigate to the Service Home page of the GPU Server.
Click the GPU Server menu on the Service Home page. 2. GPU Server List Go to the page.
On the GPU Server List page, click the resource to perform Rebuild. 3. Navigate to the GPU Server Details page.
On the GPU Server Details page, click the Rebuild button.
- During a GPU Server Rebuild, the server status changes to Rebuilding, and when the Rebuild is complete, it returns to the state it was in before the Rebuild.
- For detailed information about the GPU Server status, please refer to GPU Server 상세 정보 확인하기.

Terminate GPU Server

If you terminate an unused GPU server, you can reduce operating costs. However, if you terminate the GPU server, the running service may be stopped immediately, so you should proceed with the termination only after fully considering the impact of service interruption.

Caution

Please be aware that data cannot be recovered after terminating the service.

To cancel the GPU Server, follow the steps below.

All Services > Compute > GPU Server Click the menu. 1. Navigate to the Service Home page of the GPU Server.
Click the GPU Server menu on the Service Home page. 2. GPU Server List Go to the page.
On the GPU Server List page, select the resource to cancel, and click the Cancel Service button.
- The termination of attached storage depends on the Delete on termination setting, so refer to 해지 제약 사항.
When termination is complete, check on the GPU Server List page whether the resources have been terminated.

Termination constraints

When a GPU Server termination request cannot be processed, a popup will provide guidance. Please refer to the case below.

Cancellation not possible

When an LB server group is attached: Terminate the attached LB server group first.
If Lock is set: Please change the Lock setting to disabled and try again.

The termination of attached storage depends on the Delete on termination setting.

Delete on termination per setting

Whether the volume is deleted also varies depending on the Delete on termination setting.
- Delete on termination When not set: Even if you terminate the GPU Server, the volume will not be deleted.
- Delete on termination When enabled: terminating the GPU Server will delete the associated volume.
Volumes that have a snapshot will not be deleted even if Delete on termination is set.
A multi-attach volume can be deleted only when the server being deleted is the last remaining server attached to the volume.

2.1 - Manage Image

Users can create the service by entering the required information for the Image service within the GPU Server service and selecting detailed options through the Samsung Cloud Platform Console.

Create Image

You can create an image of a running GPU Server. To create an image of a GPU Server, please refer to Image Creation.

Check Image detailed information

Image service allows you to view and edit the full resource list and detailed information. The Image Details page consists of Detailed Information, Tags, Operation History tabs.

To view detailed information of the Image service, follow these steps.

Click the All Services > Compute > GPU Server menu. You will be taken to the Service Home page of GPU Server.
On the Service Home page, click the Image menu. You will be taken to the Image list page.

On the Image List page, click the resource to view detailed information. You will be taken to the Image Detail page.

Image Details page displays status information and additional feature information, and consists of Details, Tags, Activity Log tabs.

Category	Detailed description
Image status	Status of user-created Image Active: Available state Queued: Image has been uploaded and is waiting for processing after creation Importing: Image has been uploaded and is currently being processed after creation
Share with another account	Image can be shared with another Account The Image’s Visibility must be set to Shared in order to be shared with another Account
Delete image	Button to delete the Image Once the Image is deleted, it cannot be restored

Category

Detailed description

Image status

Status of user-created Image

Active: Available state

Queued: Image has been uploaded and is waiting for processing after creation

Importing: Image has been uploaded and is currently being processed after creation

Share with another account

Image can be shared with another Account

The Image’s Visibility must be set to Shared in order to be shared with another Account

Delete image

Button to delete the Image

Once the Image is deleted, it cannot be restored

Table. GPU Server Image status information and additional features

Detailed Information

Image list page lets you view detailed information of the selected resource and modify it if necessary.

Category	Detailed description
service	Service name
Resource Type	Resource Type
SRN	Unique resource ID in Samsung Cloud Platform refers to the SRN of a GPU Server Image
Resource name	Image name
Resource ID	Image ID
constructor	User who created the Image
Creation date and time	Image creation timestamp
editor	User who edited the Image
Modification date	Image modification timestamp
image name	Image name
Minimum disk	Minimum disk capacity (GB) of the Image If you need to modify the minimum disk, click the Edit button to set it
Minimum RAM	Minimum RAM size (GB) of the Image
OS type	OS type of the image
OS hash algorithm	OS hash algorithm method
Visibility	Display access permissions for the image Private can be used only within the project, and Shared can be shared across projects
Protected	Select whether image deletion is prohibited Checking Use can prevent accidental deletion of images This setting can be changed after the image is created
image file URL	Image file URL uploaded when generating an image GPU Server detail page does not display images created through the image generation menu
Sharing status	Current status of sharing images with another Account Approved Account ID: ID of the Account for which sharing is approved Modification date and time: The date and time when sharing was requested to another Account; if the sharing status changes from Pending to Accepted, it is updated to that date and time Status: Approved status Accepted: sharing is approved and active Pending: awaiting approval Stop sharing: sharing has been stopped Select the Account ID to stop sharing from the list, then click the Stop sharing button at the top of the list to stop sharing all at once

Table. Image detailed information tab items

Category	Detailed description
Tag list	Tag list You can view the Key and Value information of the tag Up to 50 tags can be added per resource When entering a tag, you can search and select from the list of previously created Keys and Values

Job History

You can view the operation history of the selected resource on the Image List page.

Category	Detailed description
Task History List	Resource Change History Operation Time, Resource ID, Resource Name, Operation Details, Event Topic, Operation Result, Check Operator Information

Table. GPU Server Image Job History Tab Detailed Information Items

Image Resource Management

Describes the control and management functions of the generated Image.

To share an Image with another Account, follow the steps below.

Log in to the account to be shared and click the All Services > Compute > GPU Server menu. Go to the GPU Server’s Service Home page.
On the Service Home page, click the Image menu. You will be taken to the Image List page.
On the Image List page, click the Image you want to control. You will be taken to the Image Details page.
Click the Share to another Account button. Navigate to the Share image to another Account page.
- Share with another Account feature allows you to share an Image with another Account. To share an Image with another Account, the Image’s Visibility must be Shared.

Share image to another Account page, enter the required information, and click the Done button.

Category	Required	Detailed description
image name	-	Name of the image to share Input not allowed
Image ID	-	Shareable image ID Input not allowed
Shared Account ID	Required	Enter another Account ID to share English letters, numbers, special characters`-` within 64 characters

Table. Required input fields for sharing images to another Account

You can view the information in the sharing status of the Image Details page.
- When the request is first made, the status is Pending, and it changes to Accepted once approval is completed by the account receiving the share.

Notice

Only images created by uploading an image file from the current user can be shared with another Account. If you create a Custom Image from the image of a running GPU Server, it cannot be shared with another Account, and this feature will be provided in the future, so please note.

To receive an Image shared from another Account, follow these steps.

Log in to the account to be shared and click the All Services > Compute > GPU Server menu. Navigate to the GPU Server’s Service Home page.
On the Service Home page, click the Image menu. You will be taken to the Image List page.
On the Image List page, click the Receive Image Share button. You will be taken to the Receive Image Share popup.
Receive Image Sharing In the popup window, enter the Image’s resource ID you want to receive, and click the Confirm button.
When image sharing is complete, you can view the shared Image in the Image list.

Delete Image

You can delete unused Images. However, since an Image cannot be recovered after deletion, you should carefully consider the impact before performing the deletion.

Caution

Please note that data cannot be recovered after deleting the service.

To delete the Image, follow these steps.

Click the All Services > Compute > GPU Server menu. Go to the GPU Server’s Service Home page.
On the Service Home page, click the Image menu. You will be taken to the Image List page.
Image list page, select the resource to delete, and click the Delete button.
- On the Image List page, select multiple Image check boxes and click the Delete button at the top of the resource list.
After deletion is complete, verify on the Image list page that the resource has been removed.

2.2 - Manage Keypair

Users can create the service by entering the required Keypair information within the GPU Server service and selecting detailed options through the Samsung Cloud Platform Console.

Create a Keypair

You can create and use the Keypair service while using the GPU Server service in the Samsung Cloud Platform Console.

To create a keypair, follow these steps.

Click the All Services > Compute > GPU Server menu. You will be taken to the Service Home page of GPU Server.
On the Service Home page, click the Keypair menu. You will be taken to the Keypair List page.

On the Keypair List page, click the Keypair Create button. You will be taken to the Keypair Create page.

Enter the required information in the Service Information Input area.

Category	Required	Detailed description
Keypair name	Required	Enter the name of the Keypair to create using English letters, numbers, spaces, and special characters (`-`, `_`) within 255 characters
Keypair type	Required	ssh

Table. Keypair service information input fields

Additional Information Input area, please enter or select the required information.
Category
Required status
Detailed description
tag Selection Add Tag
Up to 50 can be added per resource
After clicking the Add Tag button, enter or select Key, Value values
Table. Keypair additional information input fields
Caution
- After creation is complete, you can download the Key only once. Since reissuance is not possible, make sure it has been downloaded.
- Store the downloaded Private Key in a safe place.

Check the input information and click the Create button.
- After creation is complete, check the created resources on the Keypair List page.

Category	Required status	Detailed description
tag	Selection	Add Tag Up to 50 can be added per resource After clicking the Add Tag button, enter or select Key, Value values

View detailed information of the Keypair

The Keypair service allows you to view and edit the full resource list and detailed information. Keypair Details page consists of Details, Tags, Activity Log tabs.

To view detailed information about a keypair, follow these steps.

Click the All Services > Compute > GPU Server menu. You will be taken to the Service Home page of GPU Server.
On the Service Home page, click the Keypair menu. You will be taken to the Keypair List page.
On the Keypair List page, click the resource to view its details. You will be taken to the Keypair Details page.
- Keypair Details page displays status information and additional feature information, and consists of Details, Tags, Activity Log tabs.

Detailed Information

Keypair List page allows you to view detailed information of the selected resource and edit the information if needed.

Category	Detailed description
service	Service name
Resource Type	Resource Type
SRN	Unique resource ID in Samsung Cloud Platform In Keypair, it refers to the Keypair SRN
Resource name	Keypair name
Resource ID	Keypair’s unique resource ID
constructor	User who created the keypair
Creation date and time	Keypair creation timestamp
editor	User who modified the keypair information
Modification date	Date and time the keypair information was modified
Keypair name	Keypair name
Fingerprint	A unique value for identifying the key
User ID	User ID of the keypair creator
public key	Public key information

Table. Keypair detailed information tab items

Category	Detailed description
Tag list	Tag list You can view the Key and Value information of the tag Up to 50 tags can be added per resource When entering a tag, you can search and select from the list of previously created Keys and Values

Job History

On the Keypair List page, you can view the operation history of the selected resource.

Category	Detailed description
Task History List	Resource Change History Operation Time, Resource ID, Resource Name, Operation Details, Event Topic, Operation Result, Check operator information

Table. Keypair operation history tab detailed information items

Keypair Resource Management

Describes the control and management functions of a keypair.

Get public key

To retrieve the public key, follow these steps.

Click the All Services > Compute > GPU Server menu. You will be taken to the Service Home page of GPU Server.
On the Service Home page, click the Keypair menu. You will be taken to the Keypair List page.

On the Keypair List page, click the More button at the top and then click the Import Public Key button. You will be taken to the Import Public Key page.

Enter or select the required information in the Required Information Input area.

Category	Required	Detailed description
Keypair name	Required	Name of the Keypair to create
Keypair type	Required	ssh
public key	Required	Enter public key Load file: Select the Attach file button to attach the public key file Only files with the following extension (.pem) can be attached Enter public key: Paste the copied public key value The public key value can be copied from the Keypair Details page

Category

Required

Detailed description

Keypair name

Required

Name of the Keypair to create

Keypair type

Required

ssh

public key

Required

Enter public key

Load file: Select the Attach file button to attach the public key file
- Only files with the following extension (.pem) can be attached

Enter public key: Paste the copied public key value
- The public key value can be copied from the Keypair Details page

Table. Required input fields for retrieving the public key

Review the entered information and click the Complete button.
- Once creation is complete, check the created resources on the Keypair List page.

Delete Keypair

You can delete unused Keypairs. However, once a Keypair is deleted it cannot be recovered, so please review the impact thoroughly beforehand before proceeding with deletion.

Caution

Please note that data cannot be recovered after deleting the service.

To delete a keypair, follow these steps.

Click the All Services > Compute > GPU Server menu. Go to the Service Home page of GPU Server.
On the Service Home page, click the Keypair menu. You will be taken to the Keypair List page.
On the Keypair List page, select the resource to delete, and click the Delete button.

On the Keypair List page, select multiple Keypair check boxes and click the Delete button at the top of the resource list.

After deletion is complete, check the Keypair List page to confirm that the resource has been removed.

2.3 - Use Multi-instance GPU on GPU Server

After creating a GPU Server, you can enable the MIG (Multi-instance GPU) feature on the GPU Server’s VM (Guest OS) and create an instance for use.

NVIDIA Multi-instance GPU Introduction

NVIDIA Multi-instance GPU (hereafter referred to as MIG) supports safely partitioning a GPU into GPU instances and running CUDA applications starting with the NVIDIA Ampere architecture. Through this, multiple users can each utilize different GPU resources to achieve optimal GPU utilization. This feature is especially useful for workloads that do not fully utilize the GPU’s computing capacity, and users can run multiple workloads in parallel to maximize utilization.

Using Multi-instance GPU feature

To use the MIG feature, create an NVIDIA GPU Server on the Samsung Cloud Platform, then enable and disable MIG. The order of applying and removing MIG is as follows.

MIG application order

Enable MIG → Create GPU Instance → Create Compute Instance → Use MIG

MIG release order

Delete Compute Instance → Delete GPU Instance → Disable MIG feature (deactivate)

Reference

MIG can be used on Samsung Cloud Platform’s next-generation GPU Server or MNGC (Multi-node GPU Cluster).
For system requirements to use MIG, refer to the NVIDIA Multi-Instance GPU User Guide.

Applying and Using MIG

After activating the MIG and creating an Instance to assign tasks, the tasks proceed in the following order.

MIG application order

Enable MIG → Create GPU Instance → Create Compute Instance → Use MIG

Note

The example of applying MIG is explained based on an A100 GPU server.

Activate MIG

Check the GPU status on the VM Instance (GuestOS) before applying MIG.

Check whether MIG mode is Disabled.

Color mode

$ nvidia-smi
Mon Sep 27 08:37:08 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4 |
| -------------------------------+----------------------+---------------------- |
| GPU  Name        Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap | Memory-Usage | GPU-Util  Compute M. |
|  |  | MIG M. |
| ===============================+======================+====================== |
| 0  NVDIA A100-SXM...  Off | 00000000:05:00.0 Off | 0 |
| N/A   32C   P0    59W / 400W | 0MiB / 81251MiB | 0%      Default |
|  |  | Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU   GI   CI       PID   Type   Process name                   GPU Memory |
| ID   ID                                                   Usage |
| ============================================================================= |
| No running processes found |
+-----------------------------------------------------------------------------+

$ nvidia-smi
Mon Sep 27 08:37:08 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4 |
| -------------------------------+----------------------+---------------------- |
| GPU  Name        Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap | Memory-Usage | GPU-Util  Compute M. |
|  |  | MIG M. |
| ===============================+======================+====================== |
| 0  NVDIA A100-SXM...  Off | 00000000:05:00.0 Off | 0 |
| N/A   32C   P0    59W / 400W | 0MiB / 81251MiB | 0%      Default |
|  |  | Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU   GI   CI       PID   Type   Process name                   GPU Memory |
| ID   ID                                                   Usage |
| ============================================================================= |
| No running processes found |
+-----------------------------------------------------------------------------+

Code block. nvidia-smi command - Check GPU disabled status (1)

Color mode

$ nvidia-smi –L
GPU 0: NVIDIA A100-SXM-80GB (UUID: GPU-c956838f-494a-92b2-6818-56eb28fe25e0)

$ nvidia-smi –L
GPU 0: NVIDIA A100-SXM-80GB (UUID: GPU-c956838f-494a-92b2-6818-56eb28fe25e0)

Code block. nvidia-smi command - Check GPU disabled status (2)

Enable MIG (Enable) for each GPU on the VM Instance (GuestOS) and reboot the VM Instance.

Color mode

$ nvidia-smi –I 0 –mig 1
Enabled MIG mode for GPU 00000000:05:00.0
All done.

# reboot

$ nvidia-smi –I 0 –mig 1
Enabled MIG mode for GPU 00000000:05:00.0
All done.

# reboot

Code block. nvidia-smi command - enable MIG

Reference

When using a GPU and configuring MIG, you may encounter the following warning message. If the warning appears, check whether any programs are running on the GPU.

Warning: MIG mode is in pending enable state for GPU 00000000:05:00.0: In use by another client. 00000000:05:00.0 is currently being used by one or more other processes (e.g. CUDA application or a monitoring application such as another instance of nvidia-smi).

Check the GPU status after applying MIG on the VM Instance(GuestOS).

Check whether MIG mode is Enabled.

Color mode

$ nvidia-smi
Mon Sep 27 09:44:33 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4 |
| -------------------------------+----------------------+---------------------- |
| GPU  Name        Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap | Memory-Usage | GPU-Util  Compute M. |
|  |  | MIG M. |
| ===============================+======================+====================== |
| 0  NVDIA A100-SXM...  Off | 00000000:05:00.0 Off | On |
| N/A   32C   P0    59W / 400W | 0MiB / 81251MiB | 0%      Default |
|  |  | Enabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| MIG devices: |
+-----------------------------------------------------------------------------+
| GPU  GI  CI  MIG | Memory-Usage | Vol | Shared |
| ID  ID  Dev | BAR1-Usage | SM     Unc | CE  ENC  DEC  OFA  JPG |
|  |  | ECC |  |
| ============================================================================= |
| No MIG devices found |
+-----------------------------------------------------------------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU   GI   CI       PID   Type   Process name                   GPU Memory |
| ID   ID                                                   Usage |
| ============================================================================= |
| No running processes found |
+-----------------------------------------------------------------------------+

$ nvidia-smi
Mon Sep 27 09:44:33 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4 |
| -------------------------------+----------------------+---------------------- |
| GPU  Name        Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap | Memory-Usage | GPU-Util  Compute M. |
|  |  | MIG M. |
| ===============================+======================+====================== |
| 0  NVDIA A100-SXM...  Off | 00000000:05:00.0 Off | On |
| N/A   32C   P0    59W / 400W | 0MiB / 81251MiB | 0%      Default |
|  |  | Enabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| MIG devices: |
+-----------------------------------------------------------------------------+
| GPU  GI  CI  MIG | Memory-Usage | Vol | Shared |
| ID  ID  Dev | BAR1-Usage | SM     Unc | CE  ENC  DEC  OFA  JPG |
|  |  | ECC |  |
| ============================================================================= |
| No MIG devices found |
+-----------------------------------------------------------------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU   GI   CI       PID   Type   Process name                   GPU Memory |
| ID   ID                                                   Usage |
| ============================================================================= |
| No running processes found |
+-----------------------------------------------------------------------------+

Code block. nvidia-smi command - Check GPU activation status (1)

Color mode

$ nvidia-smi –L
GPU 0: NVIDIA A100-SXM-80GB (UUID: GPU-c956838f-494a-92b2-6818-56eb28fe25e0)

$ nvidia-smi –L
GPU 0: NVIDIA A100-SXM-80GB (UUID: GPU-c956838f-494a-92b2-6818-56eb28fe25e0)

Code block. nvidia-smi command - Check GPU activation status (2)

GPU Instance creation

If you have enabled MIG and verified its status, you can create a GPU Instance.

Check the list of MIG GPU Instance profiles that can be created.
Color mode
$ nvidia-smi mig -i [GPU ID] -lgip
$ nvidia-smi mig -i [GPU ID] -lgip
Code block. nvidia-smi command - view MIG GPU Instance profile list

Color mode

$ nvidia-smi mig -i 0 -lgip
+-----------------------------------------------------------------------------+
| GPU instance profiles: |
| GPU   Name             ID    Instances   Memory     P2P    SM    DEC   ENC |
| Free/Total   GiB              CE    JPEG  OFA |
| ============================================================================= |
| 0 MIG 1g.10gb        19    7/7         9.50       No     14     0     0 |
| 1     0     0 |
+-----------------------------------------------------------------------------+
| 0 MIG 1g.10gb+me     20    1/1         9.50       No     14     0     0 |
| 1     1     1 |
+-----------------------------------------------------------------------------+
| 0 MIG 2g.20gb        14    3/3         19.50      No     28     1     0 |
| 2     0     0 |
+-----------------------------------------------------------------------------+
| 0 MIG 3g.40gb         9    2/2         39.50      No     42     2     0 |
| 3     0     0 |
+-----------------------------------------------------------------------------+
| 0 MIG 4g.40gb         5    1/1         39.50      No     56     2     0 |
| 4     0     0 |
+-----------------------------------------------------------------------------+
| 0 MIG 7g.80gb         0    1/1         79.25      No     98     0     0 |
| 7     1     1 |
+-----------------------------------------------------------------------------+

$ nvidia-smi mig -i 0 -lgip
+-----------------------------------------------------------------------------+
| GPU instance profiles: |
| GPU   Name             ID    Instances   Memory     P2P    SM    DEC   ENC |
| Free/Total   GiB              CE    JPEG  OFA |
| ============================================================================= |
| 0 MIG 1g.10gb        19    7/7         9.50       No     14     0     0 |
| 1     0     0 |
+-----------------------------------------------------------------------------+
| 0 MIG 1g.10gb+me     20    1/1         9.50       No     14     0     0 |
| 1     1     1 |
+-----------------------------------------------------------------------------+
| 0 MIG 2g.20gb        14    3/3         19.50      No     28     1     0 |
| 2     0     0 |
+-----------------------------------------------------------------------------+
| 0 MIG 3g.40gb         9    2/2         39.50      No     42     2     0 |
| 3     0     0 |
+-----------------------------------------------------------------------------+
| 0 MIG 4g.40gb         5    1/1         39.50      No     56     2     0 |
| 4     0     0 |
+-----------------------------------------------------------------------------+
| 0 MIG 7g.80gb         0    1/1         79.25      No     98     0     0 |
| 7     1     1 |
+-----------------------------------------------------------------------------+

Code block. MIG GPU Instance profile list

Reference

Refer to the NVIDIA Multi-Instance GPU User Guide for GPU Instance profiles.

After creating a MIG GPU Instance, check it.

Create GPU Instance

Color mode

$ nvidia-smi mig -i [GPU ID] -cgi [Profile ID]

$ nvidia-smi mig -i [GPU ID] -cgi [Profile ID]

code block. nvidia-smi command - GPU Instance creation

Color mode

$ nvidia-smi mig -i 0 -cgi 0
Successfully created GPU instance ID 0 on GPU 0 using profile MIG 7g.80gb (ID 0)

$ nvidia-smi mig -i 0 -cgi 0
Successfully created GPU instance ID 0 on GPU 0 using profile MIG 7g.80gb (ID 0)

Code block. nvidia-smi command - Example of creating a GPU Instance

Check GPU Instance

Color mode

$ nvidia-smi mig -i [GPU ID] -lgi

$ nvidia-smi mig -i [GPU ID] -lgi

code block. nvidia-smi command - check GPU Instance

Color mode

$ nvidia-smi mig -i 0 -lgi
+--------------------------------------------------------+
| GPU instances: |
| GPU   Name               Profile  Instance  Placement |
| ID       ID      Start:Size |
| ======================================================== |
| 0  MIG 7g.80gb            0        0         0:8 |
+--------------------------------------------------------+

$ nvidia-smi mig -i 0 -lgi
+--------------------------------------------------------+
| GPU instances: |
| GPU   Name               Profile  Instance  Placement |
| ID       ID      Start:Size |
| ======================================================== |
| 0  MIG 7g.80gb            0        0         0:8 |
+--------------------------------------------------------+

Code block. nvidia-smi command - example of checking GPU Instance

Compute Instance creation

If you have created a GPU Instance, you can create a Compute Instance.

Check the MIG Compute Instance profiles you can create.

Color mode

$ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] -lcip

$ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] -lcip

Code block. nvidia-smi command - Check MIG Compute Instance profile

Color mode

$ nvidia-smi mig -i 0 -gi 0 -lcip
+---------------------------------------------------------------------------------+
| Compute instance profiles: |
| GPU     GPU     Name            Profile  Instances   Exclusive      Shared |
| GPU   Instance                     ID    Free/Total     SM       DEC  ENC  OFA |
| ID                                                       CE   JPEG |
| ================================================================================= |
| 0      0      MIG 1c.7g.80gb     0      7/5           14       5    0    1 |
| 7    1 |
+---------------------------------------------------------------------------------+
| 0      0      MIG 2c.7g.80gb     1      3/3           28       5    0    1 |
| 7    1 |
+---------------------------------------------------------------------------------+
| 0      0      MIG 3c.7g.80gb     2      2/2           42       5    0    1 |
| 7    1 |
+---------------------------------------------------------------------------------+
| 0      0      MIG 4c.7g.80gb     3      1/1           56       5    0    1 |
| 7    1 |
+---------------------------------------------------------------------------------+
| 0      0      MIG 7g.80gb        4*     1/1           98       5    0    1 |
| 7    1 |
+---------------------------------------------------------------------------------+

$ nvidia-smi mig -i 0 -gi 0 -lcip
+---------------------------------------------------------------------------------+
| Compute instance profiles: |
| GPU     GPU     Name            Profile  Instances   Exclusive      Shared |
| GPU   Instance                     ID    Free/Total     SM       DEC  ENC  OFA |
| ID                                                       CE   JPEG |
| ================================================================================= |
| 0      0      MIG 1c.7g.80gb     0      7/5           14       5    0    1 |
| 7    1 |
+---------------------------------------------------------------------------------+
| 0      0      MIG 2c.7g.80gb     1      3/3           28       5    0    1 |
| 7    1 |
+---------------------------------------------------------------------------------+
| 0      0      MIG 3c.7g.80gb     2      2/2           42       5    0    1 |
| 7    1 |
+---------------------------------------------------------------------------------+
| 0      0      MIG 4c.7g.80gb     3      1/1           56       5    0    1 |
| 7    1 |
+---------------------------------------------------------------------------------+
| 0      0      MIG 7g.80gb        4*     1/1           98       5    0    1 |
| 7    1 |
+---------------------------------------------------------------------------------+

Code block. Example of MIG Compute Instance profile list

Create and verify a MIG Compute Instance.

MIG Compute Instance creation

Color mode

$ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] -cci [Compute Profile ID]

$ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] -cci [Compute Profile ID]

code block. nvidia-smi command - Create MIG Compute Instance

Color mode

$ nvidia-smi mig -i 0 -gi 0 -cci 4
Successfully created compute instance ID 0 on GPU instance ID 0 using profile MIG 7g.80gb(ID 4)

$ nvidia-smi mig -i 0 -gi 0 -cci 4
Successfully created compute instance ID 0 on GPU instance ID 0 using profile MIG 7g.80gb(ID 4)

Code block. nvidia-smi command - Example of creating a MIG Compute Instance

Check MIG Compute Instance

Color mode

$ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] –lci

$ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] –lci

code block. nvidia-smi command - check MIG Compute Instance

Color mode

$ nvidia-smi mig -i 0 -gi 0 –lci
+-----------------------------------------------------------------+
| Compute instance profiles: |
| GPU     GPU     Name            Profile  Instances   Placement |
| GPU   Instance                     ID      ID        Start:Size |
| ID |
| ================================================================= |
| 0      0      MIG 7g.80gb         4       0            0:7 |
+-----------------------------------------------------------------+

$ nvidia-smi mig -i 0 -gi 0 –lci
+-----------------------------------------------------------------+
| Compute instance profiles: |
| GPU     GPU     Name            Profile  Instances   Placement |
| GPU   Instance                     ID      ID        Start:Size |
| ID |
| ================================================================= |
| 0      0      MIG 7g.80gb         4       0            0:7 |
+-----------------------------------------------------------------+

Code block. Example of checking MIG Compute Instance

Color mode

$ nvidia-smi –L
GPU 0: NVIDIA A100-SXM-80GB (UUID: GPU-c956838f-494a-92b2-6818-56eb28fe25e0)
  MIG 7g.80gb     Device  0: (UUID: MIG-53e20040-758b-5ecb-948e-c626d03a9a32)

$ nvidia-smi –L
GPU 0: NVIDIA A100-SXM-80GB (UUID: GPU-c956838f-494a-92b2-6818-56eb28fe25e0)
  MIG 7g.80gb     Device  0: (UUID: MIG-53e20040-758b-5ecb-948e-c626d03a9a32)

Code block. nvidia-smi command - Check GPU status (1)

Color mode

$ nvidia-smi
Mon Sep 27 09:52:17 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4 |
| -------------------------------+----------------------+---------------------- |
| GPU  Name        Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap | Memory-Usage | GPU-Util  Compute M. |
|  |  | MIG M. |
| ===============================+======================+====================== |
| 0  NVDIA A100-SXM...  Off | 00000000:05:00.0 Off | On |
| N/A   32C   P0    49W / 400W | 0MiB / 81251MiB | N/A      Default |
|  |  | Enabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices: |
+-----------------------------------------------------------------------------+
| GPU  GI  CI  MIG | Memory-Usage | Vol | Shared |
| ID  ID  Dev | BAR1-Usage | SM     Unc | CE  ENC  DEC  OFA  JPG |
|  |  | ECC |  |
| ============================================================================= |
| 0    0   0    0 | 0MiB / 81251MiB | 98      0 | 7   0    5    1    1 |
|  | 1MiB / 13107... |  |  |
+-----------------------------------------------------------------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU   GI   CI       PID   Type   Process name                   GPU Memory |
| ID   ID                                                   Usage |
| ============================================================================= |
| No running processes found |
+-----------------------------------------------------------------------------+

$ nvidia-smi
Mon Sep 27 09:52:17 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4 |
| -------------------------------+----------------------+---------------------- |
| GPU  Name        Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap | Memory-Usage | GPU-Util  Compute M. |
|  |  | MIG M. |
| ===============================+======================+====================== |
| 0  NVDIA A100-SXM...  Off | 00000000:05:00.0 Off | On |
| N/A   32C   P0    49W / 400W | 0MiB / 81251MiB | N/A      Default |
|  |  | Enabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices: |
+-----------------------------------------------------------------------------+
| GPU  GI  CI  MIG | Memory-Usage | Vol | Shared |
| ID  ID  Dev | BAR1-Usage | SM     Unc | CE  ENC  DEC  OFA  JPG |
|  |  | ECC |  |
| ============================================================================= |
| 0    0   0    0 | 0MiB / 81251MiB | 98      0 | 7   0    5    1    1 |
|  | 1MiB / 13107... |  |  |
+-----------------------------------------------------------------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU   GI   CI       PID   Type   Process name                   GPU Memory |
| ID   ID                                                   Usage |
| ============================================================================= |
| No running processes found |
+-----------------------------------------------------------------------------+

Code block. nvidia-smi command - Check GPU status (2)

Using MIG

Use the MIG Instance to perform the Job.

Example of task execution

Color mode

$ docker run --gpus '"device=[GPU ID]:[MIG ID]"' -rm nvcr.io/nvidia/cuda nvidia-smi

$ docker run --gpus '"device=[GPU ID]:[MIG ID]"' -rm nvcr.io/nvidia/cuda nvidia-smi

Code block. Task execution example

You can see an example of the work performed as follows.

Color mode

$ docker run --gpus '"device=0:0"' -rm -it --network=host --shm-size=1g --ipc=host -v /root/.ssh/:/root/.ssh

================
== TensorFlow ==
================

NVIDIA Release 21.08-tf1 (build 26012104)
TensorFlow Version 1.15.5

Container image Copyright (c) 2021, NVIDIA CORPORATION. All right reserved.
...

# Run Python process
root@d622a93c9281:/workspace# python /workspace/nvidia-examples/cnn/resnet.py --num_iter 100
...
PY 3.8.10 (default, Jun 2 2021, 10:49:15)
[GCC 9.4.0]
TF 1.15.5
...

$ docker run --gpus '"device=0:0"' -rm -it --network=host --shm-size=1g --ipc=host -v /root/.ssh/:/root/.ssh

================
== TensorFlow ==
================

NVIDIA Release 21.08-tf1 (build 26012104)
TensorFlow Version 1.15.5

Container image Copyright (c) 2021, NVIDIA CORPORATION. All right reserved.
...

# Run Python process
root@d622a93c9281:/workspace# python /workspace/nvidia-examples/cnn/resnet.py --num_iter 100
...
PY 3.8.10 (default, Jun 2 2021, 10:49:15)
[GCC 9.4.0]
TF 1.15.5
...

Code block. Operation result

Check the GPU usage. (Create JOB process)

When the job runs, you can see that a process is allocated to the MIG device and its utilization increases.
Color mode
$ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] -lcip
$ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] -lcip
code block. nvidia-smi command - check GPU utilization

You can check the GPU usage as shown below.

Color mode

+-----------------------------------------------------------------------------+
| MIG devices: |
+-----------------------------------------------------------------------------+
| GPU  GI  CI  MIG | Memory-Usage | Vol | Shared |
| ID  ID  Dev | BAR1-Usage | SM     Unc | CE  ENC  DEC  OFA  JPG |
|  |  | ECC |  |
| ============================================================================= |
| 0    0   0    0 | 66562MiB / 81251MiB | 98      0 | 7   0    5    1    1 |
|  | 5MiB / 13107... |  |  |
+-----------------------------------------------------------------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU   GI   CI       PID   Type   Process name                   GPU Memory |
| ID   ID                                                   Usage |
| ============================================================================= |
| 0     0    0     17483      C   python                           66559MiB |
+-----------------------------------------------------------------------------+

+-----------------------------------------------------------------------------+
| MIG devices: |
+-----------------------------------------------------------------------------+
| GPU  GI  CI  MIG | Memory-Usage | Vol | Shared |
| ID  ID  Dev | BAR1-Usage | SM     Unc | CE  ENC  DEC  OFA  JPG |
|  |  | ECC |  |
| ============================================================================= |
| 0    0   0    0 | 66562MiB / 81251MiB | 98      0 | 7   0    5    1    1 |
|  | 5MiB / 13107... |  |  |
+-----------------------------------------------------------------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU   GI   CI       PID   Type   Process name                   GPU Memory |
| ID   ID                                                   Usage |
| ============================================================================= |
| 0     0    0     17483      C   python                           66559MiB |
+-----------------------------------------------------------------------------+

Code block. Example of checking GPU utilization.

Delete and release MIG Instance

Follow these steps to delete the MIG instance and detach the MIG.

MIG release order

Delete Compute Instance → Delete GPU Instance → Disable MIG feature (deactivate)

Compute Instance Delete

Delete the Compute Instance.

Color mode

$ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] –dci
$ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] -ci [Compute Instance] –dci

$ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] –dci
$ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] -ci [Compute Instance] –dci

Code block. nvidia-smi command - Delete Compute Instance

Color mode

$ nvidia-smi mig -i 0 -gi 0 –lci
+-----------------------------------------------------------------+
| Compute instance profiles: |
| GPU     GPU     Name            Profile  Instances   Placement |
| GPU   Instance                     ID      ID        Start:Size |
| ID |
| ================================================================= |
| 0      0      MIG 7g.80gb         4       0            0:7 |
+-----------------------------------------------------------------+

$ nvidia-smi mig -i 0 -gi 0 –lci
+-----------------------------------------------------------------+
| Compute instance profiles: |
| GPU     GPU     Name            Profile  Instances   Placement |
| GPU   Instance                     ID      ID        Start:Size |
| ID |
| ================================================================= |
| 0      0      MIG 7g.80gb         4       0            0:7 |
+-----------------------------------------------------------------+

code block. Example of checking MIG Compute Instance

Color mode

$ nvidia-smi mig -i 0 -gi 0 –dci
Successfully destroyed compute instance ID  0 from GPU instance ID  0

$ nvidia-smi mig -i 0 -gi 0 –dci
Successfully destroyed compute instance ID  0 from GPU instance ID  0

Code block. Compute Instance deletion example

Color mode

$ nvidia-smi mig -i 0 -gi 0 –lci
No compute instances found: Not found

$ nvidia-smi mig -i 0 -gi 0 –lci
No compute instances found: Not found

Code block. Confirm Compute Instance deletion

Delete GPU Instance

Delete the GPU Instance.

Color mode

$ nvidia-smi mig -i [GPU ID] –dgi
$ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] –dgi

$ nvidia-smi mig -i [GPU ID] –dgi
$ nvidia-smi mig -i [GPU ID] -gi [GPU Instance ID] –dgi

Code block. nvidia-smi command - Delete GPU Instance

Color mode

$ nvidia-smi mig -i 0 -lgi
+--------------------------------------------------------+
| GPU instances: |
| GPU   Name               Profile  Instance  Placement |
| ID       ID      Start:Size |
| ======================================================== |
| 0  MIG 7g.80gb            0        0         0:8 |
+--------------------------------------------------------+

$ nvidia-smi mig -i 0 -lgi
+--------------------------------------------------------+
| GPU instances: |
| GPU   Name               Profile  Instance  Placement |
| ID       ID      Start:Size |
| ======================================================== |
| 0  MIG 7g.80gb            0        0         0:8 |
+--------------------------------------------------------+

Code block. nvidia-smi command - Example of checking GPU Instance

Color mode

$ nvidia-smi mig -i 0 -dgi
Successfully destroyed GPU instance ID  0 from GPU  0

$ nvidia-smi mig -i 0 -dgi
Successfully destroyed GPU instance ID  0 from GPU  0

Code block. nvidia-smi command - GPU Instance deletion example

Color mode

$ nvidia-smi mig -i 0 -lgi
No GPU instances found: Not found

$ nvidia-smi mig -i 0 -lgi
No GPU instances found: Not found

code block. nvidia-smi command - example of deleting a GPU Instance

Disable MIG feature (deactivation)

After disabling MIG (Disable), reboot.

Color mode

$ nvidia-smi -mig 0
Disabled MIG Mode for GPU 00000000:05:00.0

All done.

$ nvidia-smi -mig 0
Disabled MIG Mode for GPU 00000000:05:00.0

All done.

Code block. nvidia-smi command - disable MIG

Color mode

$ nvidia-smi
Mon Sep 30 05:18:28 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4 |
| -------------------------------+----------------------+---------------------- |
| GPU  Name        Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap | Memory-Usage | GPU-Util  Compute M. |
|  |  | MIG M. |
| ===============================+======================+====================== |
| 0  NVDIA A100-SXM...  Off | 00000000:05:00.0 Off | 0 |
| N/A   33C   P0    60W / 400W | 0MiB / 81251MiB | 0%      Default |
|  |  | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| MIG devices: |
+-----------------------------------------------------------------------------+
| GPU  GI  CI  MIG | Memory-Usage | Vol | Shared |
| ID  ID  Dev | BAR1-Usage | SM     Unc | CE  ENC  DEC  OFA  JPG |
|  |  | ECC |  |
| ============================================================================= |
| No MIG devices found |
+-----------------------------------------------------------------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU   GI   CI       PID   Type   Process name                   GPU Memory |
| ID   ID                                                   Usage |
| ============================================================================= |
| No running processes found |
+-----------------------------------------------------------------------------+

$ nvidia-smi
Mon Sep 30 05:18:28 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4 |
| -------------------------------+----------------------+---------------------- |
| GPU  Name        Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap | Memory-Usage | GPU-Util  Compute M. |
|  |  | MIG M. |
| ===============================+======================+====================== |
| 0  NVDIA A100-SXM...  Off | 00000000:05:00.0 Off | 0 |
| N/A   33C   P0    60W / 400W | 0MiB / 81251MiB | 0%      Default |
|  |  | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| MIG devices: |
+-----------------------------------------------------------------------------+
| GPU  GI  CI  MIG | Memory-Usage | Vol | Shared |
| ID  ID  Dev | BAR1-Usage | SM     Unc | CE  ENC  DEC  OFA  JPG |
|  |  | ECC |  |
| ============================================================================= |
| No MIG devices found |
+-----------------------------------------------------------------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU   GI   CI       PID   Type   Process name                   GPU Memory |
| ID   ID                                                   Usage |
| ============================================================================= |
| No running processes found |
+-----------------------------------------------------------------------------+

Code block. nvidia-smi command - check GPU status

2.4 - Use NVSwitch on GPU Server

After creating a GPU Server, you can enable the NVSwitch feature on the GPU Server’s VM (Guest OS) and use fast GPU-to-GPU P2P communication.

Caution

Only the GPU Server (8 GPU) and Multi-node GPU Cluster of Samsung Cloud Platform are connected with NVSwitch and NVLink.

Exploring NVIDIA NVSwitch for Multi GPU

NVLink expands I/O by directly connecting multiple GPUs within a server both bidirectionally and GPU-to-GPU. Using NVSwitch, you can connect all GPUs in a server with full NVLink bandwidth.

Checking NVSwitch operation

Check the NVIDIA Fabric Manager, NVIDIA NVLink topology, and NVIDIA NVLink Status on the GPU server.

Reference

The example for checking NVSwitch operation is explained using the A100 GPU Server (g1v128a8) as a reference.

NVIDIA Fabric Manager operating status

Verify that active (running) is displayed when operating normally.

~$ systemctl status nvidia-fabricmanager

Color mode

nvidia-fabricmanager.service - NVIDIA fabric manager service
     Loaded: loaded (/lib/systemd/system/nvidia-fabricmanager.service; enabled; vendor preset: enabled)
     Active: active (running) since Mon 2026-02-02 16:23:27 KST; 32min ago
   Main PID: 2191 (nv-fabricmanage)
      Tasks: 18 (limit: 629145)
     Memory: 18.0M
        CPU: 33.461s
     CGroup: /system.slice/nvidia-fabricmanager.service
             └─2191 /usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabricmanager.cfg

nvidia-fabricmanager.service - NVIDIA fabric manager service
     Loaded: loaded (/lib/systemd/system/nvidia-fabricmanager.service; enabled; vendor preset: enabled)
     Active: active (running) since Mon 2026-02-02 16:23:27 KST; 32min ago
   Main PID: 2191 (nv-fabricmanage)
      Tasks: 18 (limit: 629145)
     Memory: 18.0M
        CPU: 33.461s
     CGroup: /system.slice/nvidia-fabricmanager.service
             └─2191 /usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabricmanager.cfg

Code block. Check NVIDIA Fabric Manager status

Check NVIDIA NVLink topology

Check the NVIDIA NVLink topology.

~$ nvidia-smi topo -m

Color mode

nvidia-smi topo -m
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV12    NV12    NV12    NV12    NV12    NV12    NV12    0-127   0-7             N/A
GPU1    NV12     X      NV12    NV12    NV12    NV12    NV12    NV12    0-127   0-7             N/A
GPU2    NV12    NV12     X      NV12    NV12    NV12    NV12    NV12    0-127   0-7             N/A
GPU3    NV12    NV12    NV12     X      NV12    NV12    NV12    NV12    0-127   0-7             N/A
GPU4    NV12    NV12    NV12    NV12     X      NV12    NV12    NV12    0-127   0-7             N/A
GPU5    NV12    NV12    NV12    NV12    NV12     X      NV12    NV12    0-127   0-7             N/A
GPU6    NV12    NV12    NV12    NV12    NV12    NV12     X      NV12    0-127   0-7             N/A
GPU7    NV12    NV12    NV12    NV12    NV12    NV12    NV12     X      0-127   0-7             N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

nvidia-smi topo -m
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV12    NV12    NV12    NV12    NV12    NV12    NV12    0-127   0-7             N/A
GPU1    NV12     X      NV12    NV12    NV12    NV12    NV12    NV12    0-127   0-7             N/A
GPU2    NV12    NV12     X      NV12    NV12    NV12    NV12    NV12    0-127   0-7             N/A
GPU3    NV12    NV12    NV12     X      NV12    NV12    NV12    NV12    0-127   0-7             N/A
GPU4    NV12    NV12    NV12    NV12     X      NV12    NV12    NV12    0-127   0-7             N/A
GPU5    NV12    NV12    NV12    NV12    NV12     X      NV12    NV12    0-127   0-7             N/A
GPU6    NV12    NV12    NV12    NV12    NV12    NV12     X      NV12    0-127   0-7             N/A
GPU7    NV12    NV12    NV12    NV12    NV12    NV12    NV12     X      0-127   0-7             N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

Code block. Check NVIDIA NVLink topology

Check NVIDIA NVLink Status

Check the NVIDIA NVLink Status.

~$ nvidia-smi topo -m

Color mode

GPU 1: NVIDIA A100-SXM4-80GB (UUID: GPU-64a2f685-bb12-c4af-105c-0726ece9c8d7)
         Link 0: 25 GB/s
         Link 1: 25 GB/s
         Link 2: 25 GB/s
         Link 3: 25 GB/s
         Link 4: 25 GB/s
         Link 5: 25 GB/s
         Link 6: 25 GB/s
         Link 7: 25 GB/s
         Link 8: 25 GB/s
         Link 9: 25 GB/s
         Link 10: 25 GB/s
         Link 11: 25 GB/s
GPU 2: NVIDIA A100-SXM4-80GB (UUID: GPU-2269851b-71cd-f6c7-50c5-ba1525cf3ce8)
         Link 0: 25 GB/s
         Link 1: 25 GB/s
         Link 2: 25 GB/s
         Link 3: 25 GB/s
         Link 4: 25 GB/s
         Link 5: 25 GB/s
         Link 6: 25 GB/s
         Link 7: 25 GB/s
         Link 8: 25 GB/s
         Link 9: 25 GB/s
         Link 10: 25 GB/s
         Link 11: 25 GB/s
GPU 3: NVIDIA A100-SXM4-80GB (UUID: GPU-4c397bbf-95fc-5c29-918a-a429cbe45a7a)
         Link 0: 25 GB/s
         Link 1: 25 GB/s
         Link 2: 25 GB/s
         Link 3: 25 GB/s
         Link 4: 25 GB/s
         Link 5: 25 GB/s
         Link 6: 25 GB/s
         Link 7: 25 GB/s
         Link 8: 25 GB/s
         Link 9: 25 GB/s
         Link 10: 25 GB/s
         Link 11: 25 GB/s
GPU 4: NVIDIA A100-SXM4-80GB (UUID: GPU-0e350204-9fb6-2cbe-538e-8f7849658eb8)
         Link 0: 25 GB/s
         Link 1: 25 GB/s
         Link 2: 25 GB/s
         Link 3: 25 GB/s
         Link 4: 25 GB/s
         Link 5: 25 GB/s
         Link 6: 25 GB/s
         Link 7: 25 GB/s
         Link 8: 25 GB/s
         Link 9: 25 GB/s
         Link 10: 25 GB/s
         Link 11: 25 GB/s
GPU 5: NVIDIA A100-SXM4-80GB (UUID: GPU-45f0c453-4760-edd4-3af9-25c5ea7473a5)
         Link 0: 25 GB/s
         Link 1: 25 GB/s
         Link 2: 25 GB/s
         Link 3: 25 GB/s
         Link 4: 25 GB/s
         Link 5: 25 GB/s
         Link 6: 25 GB/s
         Link 7: 25 GB/s
         Link 8: 25 GB/s
         Link 9: 25 GB/s
         Link 10: 25 GB/s
         Link 11: 25 GB/s
GPU 6: NVIDIA A100-SXM4-80GB (UUID: GPU-38409794-bb34-430e-3c50-90b42cb2bb72)
         Link 0: 25 GB/s
         Link 1: 25 GB/s
         Link 2: 25 GB/s
         Link 3: 25 GB/s
         Link 4: 25 GB/s
         Link 5: 25 GB/s
         Link 6: 25 GB/s
         Link 7: 25 GB/s
         Link 8: 25 GB/s
         Link 9: 25 GB/s
         Link 10: 25 GB/s
         Link 11: 25 GB/s
GPU 7: NVIDIA A100-SXM4-80GB (UUID: GPU-3fb478aa-801b-eb64-55c2-0ffc3f2ce404)
         Link 0: 25 GB/s
         Link 1: 25 GB/s
         Link 2: 25 GB/s
         Link 3: 25 GB/s
         Link 4: 25 GB/s
         Link 5: 25 GB/s
         Link 6: 25 GB/s
         Link 7: 25 GB/s
         Link 8: 25 GB/s
         Link 9: 25 GB/s
         Link 10: 25 GB/s
         Link 11: 25 GB/s

GPU 1: NVIDIA A100-SXM4-80GB (UUID: GPU-64a2f685-bb12-c4af-105c-0726ece9c8d7)
         Link 0: 25 GB/s
         Link 1: 25 GB/s
         Link 2: 25 GB/s
         Link 3: 25 GB/s
         Link 4: 25 GB/s
         Link 5: 25 GB/s
         Link 6: 25 GB/s
         Link 7: 25 GB/s
         Link 8: 25 GB/s
         Link 9: 25 GB/s
         Link 10: 25 GB/s
         Link 11: 25 GB/s
GPU 2: NVIDIA A100-SXM4-80GB (UUID: GPU-2269851b-71cd-f6c7-50c5-ba1525cf3ce8)
         Link 0: 25 GB/s
         Link 1: 25 GB/s
         Link 2: 25 GB/s
         Link 3: 25 GB/s
         Link 4: 25 GB/s
         Link 5: 25 GB/s
         Link 6: 25 GB/s
         Link 7: 25 GB/s
         Link 8: 25 GB/s
         Link 9: 25 GB/s
         Link 10: 25 GB/s
         Link 11: 25 GB/s
GPU 3: NVIDIA A100-SXM4-80GB (UUID: GPU-4c397bbf-95fc-5c29-918a-a429cbe45a7a)
         Link 0: 25 GB/s
         Link 1: 25 GB/s
         Link 2: 25 GB/s
         Link 3: 25 GB/s
         Link 4: 25 GB/s
         Link 5: 25 GB/s
         Link 6: 25 GB/s
         Link 7: 25 GB/s
         Link 8: 25 GB/s
         Link 9: 25 GB/s
         Link 10: 25 GB/s
         Link 11: 25 GB/s
GPU 4: NVIDIA A100-SXM4-80GB (UUID: GPU-0e350204-9fb6-2cbe-538e-8f7849658eb8)
         Link 0: 25 GB/s
         Link 1: 25 GB/s
         Link 2: 25 GB/s
         Link 3: 25 GB/s
         Link 4: 25 GB/s
         Link 5: 25 GB/s
         Link 6: 25 GB/s
         Link 7: 25 GB/s
         Link 8: 25 GB/s
         Link 9: 25 GB/s
         Link 10: 25 GB/s
         Link 11: 25 GB/s
GPU 5: NVIDIA A100-SXM4-80GB (UUID: GPU-45f0c453-4760-edd4-3af9-25c5ea7473a5)
         Link 0: 25 GB/s
         Link 1: 25 GB/s
         Link 2: 25 GB/s
         Link 3: 25 GB/s
         Link 4: 25 GB/s
         Link 5: 25 GB/s
         Link 6: 25 GB/s
         Link 7: 25 GB/s
         Link 8: 25 GB/s
         Link 9: 25 GB/s
         Link 10: 25 GB/s
         Link 11: 25 GB/s
GPU 6: NVIDIA A100-SXM4-80GB (UUID: GPU-38409794-bb34-430e-3c50-90b42cb2bb72)
         Link 0: 25 GB/s
         Link 1: 25 GB/s
         Link 2: 25 GB/s
         Link 3: 25 GB/s
         Link 4: 25 GB/s
         Link 5: 25 GB/s
         Link 6: 25 GB/s
         Link 7: 25 GB/s
         Link 8: 25 GB/s
         Link 9: 25 GB/s
         Link 10: 25 GB/s
         Link 11: 25 GB/s
GPU 7: NVIDIA A100-SXM4-80GB (UUID: GPU-3fb478aa-801b-eb64-55c2-0ffc3f2ce404)
         Link 0: 25 GB/s
         Link 1: 25 GB/s
         Link 2: 25 GB/s
         Link 3: 25 GB/s
         Link 4: 25 GB/s
         Link 5: 25 GB/s
         Link 6: 25 GB/s
         Link 7: 25 GB/s
         Link 8: 25 GB/s
         Link 9: 25 GB/s
         Link 10: 25 GB/s
         Link 11: 25 GB/s

Code block. Check NVIDIA NVLink status

2.5 - Install ServiceWatch Agent

Users can install the ServiceWatch Agent on a GPU server to collect custom metrics and logs.

Reference

Custom metric/log collection via the ServiceWatch Agent is currently available only on Samsung Cloud Platform For Enterprise. It will also be available in other offerings in the future.

Caution

Metric collection through the ServiceWatch Agent is classified as custom metrics and incurs charges, unlike the default metrics collected from each service; therefore, we recommend removing or disabling any unnecessary metric collection settings.

ServiceWatch Agent

There are two main types of agents that need to be installed on a GPU server to collect custom metrics and logs for ServiceWatch. It is the Prometheus Exporter and Open Telemetry Collector.

Category	Detailed description
Prometheus Exporter	Provide metrics of a specific application or service in a format that Prometheus can scrape To collect OS metrics from servers, you can use Node Exporter for Linux servers and Windows Exporter for Windows servers, depending on the OS type. For OS metric collection on GPU Servers, you can use Node Exporter just like on Virtual Servers; see Virtual Server > ServiceWatch Agent for details You can use the DCGM (NVIDIA Data Center GPU Manager) Exporter for GPU metrics This guide explains how to use the DCGM Exporter in the GPU metrics Prometheus Exporter installation (for Ubuntu) and GPU metrics Prometheus Exporter installation (for RHEL) guides
Open Telemetry Collector	Acts as a centralized collector that gathers telemetry data such as metrics and logs from distributed systems, processes (filtering, sampling, etc.) it, and exports it to various backends (e.g., Prometheus, Jaeger, Elasticsearch, etc.) Exports data via the ServiceWatch Gateway so that ServiceWatch can collect metric and log data. For instructions on using the Open Telemetry Collector provided by ServiceWatch, see ServiceWatch > ServiceWatch Agent 사용하기

Category

Detailed description

Prometheus Exporter

Provide metrics of a specific application or service in a format that Prometheus can scrape

To collect OS metrics from servers, you can use Node Exporter for Linux servers and Windows Exporter for Windows servers, depending on the OS type.
- For OS metric collection on GPU Servers, you can use Node Exporter just like on Virtual Servers; see Virtual Server > ServiceWatch Agent for details

You can use the DCGM (NVIDIA Data Center GPU Manager) Exporter for GPU metrics
- This guide explains how to use the DCGM Exporter in the GPU metrics Prometheus Exporter installation (for Ubuntu) and GPU metrics Prometheus Exporter installation (for RHEL) guides

Open Telemetry Collector

Acts as a centralized collector that gathers telemetry data such as metrics and logs from distributed systems, processes (filtering, sampling, etc.) it, and exports it to various backends (e.g., Prometheus, Jaeger, Elasticsearch, etc.)

Exports data via the ServiceWatch Gateway so that ServiceWatch can collect metric and log data.

For instructions on using the Open Telemetry Collector provided by ServiceWatch, see ServiceWatch > ServiceWatch Agent 사용하기

Table. Explanation of Prometheus Exporter and Open Telemetry Collector

Caution

If you have configured a Kubernetes Engine on a GPU server, please view the GPU metrics using the metrics provided by the Kubernetes Engine.

If you install the DCGM Exporter on a GPU server configured with Kubernetes Engine, it may not operate correctly.

Pre-configuration for using ServiceWatch Agent

To use the ServiceWatch Agent, please refer to ServiceWatch Agent를 위한 사전 환경 설정 and prepare the prerequisite configuration.

Install Prometheus Exporter for GPU metrics (for Ubuntu)

Install the Prometheus Exporter for collecting metrics from the GPU server according to the steps below.

Step 1. NVDIA Driver 설치 확인
Step 2. NVSwitch Configuration and Query (NSCQ) Library Installation
Step 3. NVSwitch Device Monitoring API(NVSDM) Library Installation
Step 4. NVIDIA DCGM Installation

Verify NVDIA Driver Installation

Check the installed NVDIA driver.
Color mode
nvidia-smi --query-gpu driver_version --format csv
nvidia-smi --query-gpu driver_version --format csv
Code block. NVDIA Driver version check command
Color mode
driver_version 535.183.06 535.183.06
driver_version 535.183.06 535.183.06
code block. Example of checking NVDIA Driver version.

NVSwitch Configuration and Query (NSCQ) Library Installation

Reference

The NVSwitch Configuration and Query (NSCQ) Library is required for Hopper or earlier generation GPUs.

information

The following installation commands are applicable in environments with internet access. If you are in an environment without internet access, you must download libnvdia-nscq from https://developer.download.nvidia.com/compute/cuda/repos/ and upload it.

Install cuda-keyring.

Color mode

wget https://developer.download.nvidia.com/compute/cuda/repos/<distro>/<arch>/cuda-keyring_1.1-1_all.deb

wget https://developer.download.nvidia.com/compute/cuda/repos/<distro>/<arch>/cuda-keyring_1.1-1_all.deb

code block. NSCQ library download command

Color mode

sudo dpkg -i cuda-keyring_1.1-1_all.deb
apt update

sudo dpkg -i cuda-keyring_1.1-1_all.deb
apt update

Code block. NSCQ library installation command

Color mode

nvidia-smi --query-gpu driver_version --format csv

nvidia-smi --query-gpu driver_version --format csv

Code block. NVDIA Driver version check command

Color mode

driver_version
535.183.06
...

535.183.06

driver_version
535.183.06
...

535.183.06

Code block. NVDIA Driver version check example

Install libnvidia-nscq.

Color mode

apt-cache policy libnvidia-nscq-535

apt-cache policy libnvidia-nscq-535

Code block. NSCQ library apt-cache command

Color mode

libnvidia-nscq-535:
  Installed: (none)
  Candidate: 535.247.01-1
  Version table:
     535.247.01-1 600
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
...
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
     535.216.01-1 600
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
     535.183.06-1 600  # Install the version that matches the Driver
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
     535.183.01-1 600
...
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
     535.54.03-1 600
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages

libnvidia-nscq-535:
  Installed: (none)
  Candidate: 535.247.01-1
  Version table:
     535.247.01-1 600
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
...
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
     535.216.01-1 600
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
     535.183.06-1 600  # Install the version that matches the Driver
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
     535.183.01-1 600
...
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
     535.54.03-1 600
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages

code block. NSCQ library apt-cache command result

Color mode

apt install libnvidia-nscq-535=535.183.06-1

apt install libnvidia-nscq-535=535.183.06-1

Code block. NSCQ library installation command

information

It must be installed with the same version as the NVDIA driver.

Example) driver version: 535.183.06, libnvdia-nscq version: 535.183.06-1

NVSwitch Device Monitoring API(NVSDM) Library Installation

Reference

After Blackwell, GPU architectures require the NVSDM Library to be installed. NVDIA Driver version 560 or lower does not provide the NVSDM Library.

Install the NVSDM library.

Color mode

apt-cache policy libnvsdm

apt-cache policy libnvsdm

code block. NVSDM library apt-cache command

Color mode

libnvsdm:
  Installed: (none)
  Candidate: 580.105.08-1
  Version table:
     580.105.08-1 600
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
     580.95.05-1 600
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
     580.82.07-1 600
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
     580.65.06-1 600
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages

libnvsdm:
  Installed: (none)
  Candidate: 580.105.08-1
  Version table:
     580.105.08-1 600
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
     580.95.05-1 600
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
     580.82.07-1 600
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
     580.65.06-1 600
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages

Code block. NVSDM library apt-cache command result

Color mode

apt install libnvsdm=580.105.08-1

apt install libnvsdm=580.105.08-1

Code block. Install NVSDM library

NVIDIA DCGM installation (for Ubuntu)

Install the DCGM Exporter according to the steps below.

DCGM(datacenter-gpu-manager) Installation

refers to a specific version of NVIDIA’s Data Center GPU Manager (DCGM) tool, which is a package for managing and monitoring NVIDIA data center GPUs. In particular, cuda12 indicates that this management tool is installed for the CUDA 12 version, and datacenter-gpu-manager-4 refers to the 4.x version of DCGM. This tool provides various features, including GPU status monitoring, diagnostics, alert system, and power/clock management.

Check the CUDA version.
Color mode
nvidia-smi | grep CUDA
nvidia-smi | grep CUDA
Code block. Check CUDA version
Color mode
| NVIDIA-SMI 535.183.06 Driver Version: 535.183.06 CUDA Version: 12.2 |
| NVIDIA-SMI 535.183.06 Driver Version: 535.183.06 CUDA Version: 12.2 |
Code block. Example of CUDA version check result
Color mode
CUDA_VERSION=12
CUDA_VERSION=12
Code block. CUDA version setting command
Install datacenter-gpu-manager-cuda.
Color mode
apt install datacenter-gpu-manager-4-cuda${CUDA_VERSION}
apt install datacenter-gpu-manager-4-cuda${CUDA_VERSION}
code block. datacenter-gpu-manager-cuda installation command

datacenter-gpu-manager-exporter installation

It is a tool that, based on NVIDIA Data Center GPU Manager (DCGM), collects various GPU metrics such as GPU usage, memory usage, temperature, and power consumption, and exposes them for use in monitoring systems like Prometheus.

Install datacenter-gpu-manager-exporter.
Color mode
apt install datacenter-gpu-manager-exporter
apt install datacenter-gpu-manager-exporter
Code block. datacenter-gpu-manager-exporter installation command
Check the DCGM Exporter configuration file.
Color mode
cat /usr/lib/systemd/system/nvidia-dcgm-exporter.service | grep ExecStart
cat /usr/lib/systemd/system/nvidia-dcgm-exporter.service | grep ExecStart
Code block. Command to check the datacenter-gpu-manager-exporter configuration file.
Color mode
ExecStart=/usr/bin/dcgm-exporter -f /etc/dcgm-exporter/default-counters.csv
ExecStart=/usr/bin/dcgm-exporter -f /etc/dcgm-exporter/default-counters.csv
Code block. Example of checking the datacenter-gpu-manager-exporter configuration file.

When installing the DCGM Exporter, review the provided settings and remove # for required metrics, and add # for unnecessary metrics.

Color mode

vi /etc/dcgm-exporter/default-counters.csv
## Example ##

DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active.
DCGM_FI_PROF_DRAM_ACTIVE,        gauge, Ratio of cycles the device memory interface is active sending or receiving data.
# DCGM_FI_PROF_PIPE_FP64_ACTIVE,   gauge, Ratio of cycles the fp64 pipes are active.
# DCGM_FI_PROF_PIPE_FP32_ACTIVE,   gauge, Ratio of cycles the fp32 pipes are active.

vi /etc/dcgm-exporter/default-counters.csv
## Example ##

DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active.
DCGM_FI_PROF_DRAM_ACTIVE,        gauge, Ratio of cycles the device memory interface is active sending or receiving data.
# DCGM_FI_PROF_PIPE_FP64_ACTIVE,   gauge, Ratio of cycles the fp64 pipes are active.
# DCGM_FI_PROF_PIPE_FP32_ACTIVE,   gauge, Ratio of cycles the fp32 pipes are active.

Code block. datacenter-gpu-manager-exporter metric configuration example

Reference

For the metrics that can be collected with the GPU DCGM Exporter and how to configure them, see DCGM Exporter 지표.

Caution

Since metric collection through the ServiceWatch Agent is classified as custom metrics and incurs charges unlike the default metrics, unnecessary metric collection must be removed or disabled to prevent excessive fees.

Enable and start DCGM service

Enable and start the nvdia-dcgm service.
Color mode
systemctl enable --now nvidia-dcgm
systemctl enable --now nvidia-dcgm
Code block. nvdia-dcgm service activation and start command
Enable and start the nvdia-dcgm-exporter service.
Color mode
systemctl enable --now nvidia-dcgm-exporter
systemctl enable --now nvidia-dcgm-exporter
Code block. Command to enable and start the nvdia-dcgm-exporter service

Information

If you have completed the DCGM Exporter setup, you must install the OpenTelemetry Collector provided by ServiceWatch to finish configuring the ServiceWatch Agent.
For more details, refer to ServiceWatch > ServiceWatch Agent 사용하기.

Installation of Prometheus Exporter for GPU metrics (for RHEL)

Install the ServiceWatch Agent according to the steps below to collect metrics from the GPU server.

NVDIA Driver Installation Check (for RHEL)

Check the installed NVDIA Driver.
Color mode
nvidia-smi --query-gpu driver_version --format csv
nvidia-smi --query-gpu driver_version --format csv
Code block. NVDIA Driver version check command
Color mode
driver_version 535.183.06 ... 535.183.06
driver_version 535.183.06 ... 535.183.06
Code block. Example of checking NVDIA Driver version

NVSwitch Configuration and Query (NSCQ) Library Installation (for RHEL)

Reference

NVSwitch Configuration and Query (NSCQ) Library is required for Hopper or earlier generation GPUs.

For RHEL, verify that libnvdia-nscq is installed and install it if necessary.

guide

The following installation commands are applicable in environments with internet access. If you are in an environment without internet access, you need to download libnvdia-nscq from https://developer.download.nvidia.com/compute/cuda/repos/ and upload it.

Checking the libnvdia-nscq package.

Color mode

rpm -qa | grep libnvidia-nscq libnvidia-nscq-535-535.183.06-1.x86_64

rpm -qa | grep libnvidia-nscq libnvidia-nscq-535-535.183.06-1.x86_64

Code block. Check NSCQ library package

Add the CUDA Repository to DNF.

Color mode

dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo

dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo

Code block. Add DNF Repository

NVDIA Driver state reset

Color mode

dnf module reset nvidia-driver

dnf module reset nvidia-driver

Code block. Initialize the state of the NVIDIA Driver DNF module

Color mode

Updating Subscription Management repositories.
Last metadata expiration check: 0:03:15 ago on Wed 19 Nov 2025 01:23:48 AM EST.
Dependencies resolved.
=============================================
Package Architecture Version Repository Size
=============================================
Disabling module profiles:
nvidia-driver/default
nvidia-driver/fm
Resetting modules:
nvidia-driver

Transaction Summary
=============================================

Is this ok [y/N]: y

Updating Subscription Management repositories.
Last metadata expiration check: 0:03:15 ago on Wed 19 Nov 2025 01:23:48 AM EST.
Dependencies resolved.
=============================================
Package Architecture Version Repository Size
=============================================
Disabling module profiles:
nvidia-driver/default
nvidia-driver/fm
Resetting modules:
nvidia-driver

Transaction Summary
=============================================

Is this ok [y/N]: y

Code block. Example of the status reset result of the NVIDIA Driver DNF module.

Activate the NVDIA Driver module.

Color mode

dnf module enable nvidia-driver:535-open

dnf module enable nvidia-driver:535-open

Code block. Enable NVDIA Driver module

Color mode

Updating Subscription Management repositories.
Last metadata expiration check: 0:04:22 ago on Wed 19 Nov 2025 01:23:48 AM EST.
Dependencies resolved.
=============================================
Package Architecture Version Repository Size
=============================================
Enabling module streams:
nvidia-driver 535-open

Transaction Summary
=============================================

Is this ok [y/N]: y

Updating Subscription Management repositories.
Last metadata expiration check: 0:04:22 ago on Wed 19 Nov 2025 01:23:48 AM EST.
Dependencies resolved.
=============================================
Package Architecture Version Repository Size
=============================================
Enabling module streams:
nvidia-driver 535-open

Transaction Summary
=============================================

Is this ok [y/N]: y

code block. NVDIA Driver module activation result example

Check the libnvdia-nscq module list.
Color mode
dnf list libnvidia-nscq-535 --showduplicates
dnf list libnvidia-nscq-535 --showduplicates
Code block. Check libnvdia-nscq module list
Install libnvdia-nscq.
Color mode
dnf install libnvidia-nscq-535-535.183.06-1
dnf install libnvidia-nscq-535-535.183.06-1
code block. libnvdia-nscq installation command

NVSwitch Device Monitoring API(NVSDM) Library Installation (for RHEL)

Reference

After Blackwell, GPU architectures require the installation of the NVSDM Library. NVDIA Driver versions 560 and below do not provide the NVSDM Library.

Check the NVSDM library module list.

Color mode

dnf list libnvsdm --showduplicates

dnf list libnvsdm --showduplicates

Code block. Check NVSDM library module list

Color mode

libnvsdm:
  Installed: (none)
  Candidate: 580.105.08-1
  Version table:
     580.105.08-1 600
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
     580.95.05-1 600
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
     580.82.07-1 600
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
     580.65.06-1 600
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages

libnvsdm:
  Installed: (none)
  Candidate: 580.105.08-1
  Version table:
     580.105.08-1 600
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
     580.95.05-1 600
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
     580.82.07-1 600
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
     580.65.06-1 600
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages

code block. Example of NVSDM library module list verification results

Install libnvsdm.

Color mode

dnf install libnvsdm-580.105.08-1

dnf install libnvsdm-580.105.08-1

code block. Install NVSDM library

Color mode

Updating Subscription Management repositories.
Last metadata expiration check: 0:08:18 ago on Wed 19 Nov 2025 01:05:28 AM EST.
Dependencies resolved.
=========================================================================
Package Architecture Version Repository Size
=========================================================================
Installing:
libnvsdm x86_64 580.105.08-1 cuda-rhel8-x86_64 675 k
Installing dependencies:
infiniband-diags x86_64 48.0-1.el8 rhel-8-for-x86_64-baseos-rpms 323 k
libibumad x86_64 48.0-1.el8 rhel-8-for-x86_64-baseos-rpms 34 k

Transaction Summary
=========================================================================
Install 3 Packages

Total download size: 1.0 M
Installed size: 3.2 M
Is this ok [y/N]: y

Updating Subscription Management repositories.
Last metadata expiration check: 0:08:18 ago on Wed 19 Nov 2025 01:05:28 AM EST.
Dependencies resolved.
=========================================================================
Package Architecture Version Repository Size
=========================================================================
Installing:
libnvsdm x86_64 580.105.08-1 cuda-rhel8-x86_64 675 k
Installing dependencies:
infiniband-diags x86_64 48.0-1.el8 rhel-8-for-x86_64-baseos-rpms 323 k
libibumad x86_64 48.0-1.el8 rhel-8-for-x86_64-baseos-rpms 34 k

Transaction Summary
=========================================================================
Install 3 Packages

Total download size: 1.0 M
Installed size: 3.2 M
Is this ok [y/N]: y

Code block. Example of NVSDM library installation command output

NVIDIA DCGM installation (for RHEL)

Install Node Exporter according to the steps below.

DCGM(datacenter-gpu-manager) Installation (for RHEL)

Refers to a specific version of NVIDIA’s Data Center GPU Manager (DCGM) tool, which is a package for managing and monitoring NVIDIA data center GPUs. In particular, cuda12 indicates that this management tool is installed for the CUDA 12 version, and datacenter-gpu-manager-4 refers to the 4.x version of DCGM. This tool provides various features, including GPU status monitoring, diagnostics, alert system, and power/clock management.

Add the CUDA Repository to DNF.

Color mode

dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo

dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo

code block. Add DNF Repository

Check the CUDA version.
Color mode
nvidia-smi | grep CUDA
nvidia-smi | grep CUDA
Code block. Check CUDA version
Color mode
| NVIDIA-SMI 535.183.06 Driver Version: 535.183.06 CUDA Version: 12.2 |
| NVIDIA-SMI 535.183.06 Driver Version: 535.183.06 CUDA Version: 12.2 |
Code block. Example of CUDA version check result
Color mode
CUDA_VERSION=12
CUDA_VERSION=12
Code block. CUDA version setting command

Check the list of datacenter-gpu-manager-cuda modules.

Color mode

dnf list datacenter-gpu-manager-4-cuda${CUDA_VERSION} --showduplicates

dnf list datacenter-gpu-manager-4-cuda${CUDA_VERSION} --showduplicates

Code block. Check the datacenter-gpu-manager-cuda module list.

Color mode

Updating Subscription Management repositories.
Unable to read consumer identity

This system is not registered with an entitlement server. You can use subscription-manager to register.

Last metadata expiration check: 0:00:34 ago on Wed 19 Nov 2025 12:26:56 AM EST.
Available Packages
datacenter-gpu-manager-4-cuda12.x86_64   1:4.0.0-1    cuda-rhel8-x86_64
datacenter-gpu-manager-4-cuda12.x86_64   1:4.1.0-1    cuda-rhel8-x86_64
datacenter-gpu-manager-4-cuda12.x86_64   1:4.1.1-1    cuda-rhel8-x86_64
datacenter-gpu-manager-4-cuda12.x86_64   1:4.2.0-1    cuda-rhel8-x86_64
datacenter-gpu-manager-4-cuda12.x86_64   1:4.2.2-1    cuda-rhel8-x86_64
datacenter-gpu-manager-4-cuda12.x86_64   1:4.2.3-1    cuda-rhel8-x86_64
datacenter-gpu-manager-4-cuda12.x86_64   1:4.2.3-2    cuda-rhel8-x86_64
datacenter-gpu-manager-4-cuda12.x86_64   1:4.3.0-1    cuda-rhel8-x86_64
datacenter-gpu-manager-4-cuda12.x86_64   1:4.3.1-1    cuda-rhel8-x86_64
datacenter-gpu-manager-4-cuda12.x86_64   1:4.4.0-1    cuda-rhel8-x86_64
datacenter-gpu-manager-4-cuda12.x86_64   1:4.4.1-1    cuda-rhel8-x86_64
datacenter-gpu-manager-4-cuda12.x86_64   1:4.4.2-1    cuda-rhel8-x86_64

Updating Subscription Management repositories.
Unable to read consumer identity

This system is not registered with an entitlement server. You can use subscription-manager to register.

Last metadata expiration check: 0:00:34 ago on Wed 19 Nov 2025 12:26:56 AM EST.
Available Packages
datacenter-gpu-manager-4-cuda12.x86_64   1:4.0.0-1    cuda-rhel8-x86_64
datacenter-gpu-manager-4-cuda12.x86_64   1:4.1.0-1    cuda-rhel8-x86_64
datacenter-gpu-manager-4-cuda12.x86_64   1:4.1.1-1    cuda-rhel8-x86_64
datacenter-gpu-manager-4-cuda12.x86_64   1:4.2.0-1    cuda-rhel8-x86_64
datacenter-gpu-manager-4-cuda12.x86_64   1:4.2.2-1    cuda-rhel8-x86_64
datacenter-gpu-manager-4-cuda12.x86_64   1:4.2.3-1    cuda-rhel8-x86_64
datacenter-gpu-manager-4-cuda12.x86_64   1:4.2.3-2    cuda-rhel8-x86_64
datacenter-gpu-manager-4-cuda12.x86_64   1:4.3.0-1    cuda-rhel8-x86_64
datacenter-gpu-manager-4-cuda12.x86_64   1:4.3.1-1    cuda-rhel8-x86_64
datacenter-gpu-manager-4-cuda12.x86_64   1:4.4.0-1    cuda-rhel8-x86_64
datacenter-gpu-manager-4-cuda12.x86_64   1:4.4.1-1    cuda-rhel8-x86_64
datacenter-gpu-manager-4-cuda12.x86_64   1:4.4.2-1    cuda-rhel8-x86_64

code block. Example of checking the module list of datacenter-gpu-manager-cuda

Install datacenter-gpu-manager-cuda.

Color mode

dnf install datacenter-gpu-manager-4-cuda${CUDA_VERSION}

dnf install datacenter-gpu-manager-4-cuda${CUDA_VERSION}

code block. Install datacenter-gpu-manager-cuda

Color mode

Updating Subscription Management repositories.
Unable to read consumer identity

This system is not registered with an entitlement server. You can use subscription-manager to register.

Last metadata expiration check: 0:07:12 ago on Wed 19 Nov 2025 12:26:56 AM EST.
Dependencies resolved.
===================================================================================================
 Package                                       Architecture   Version     Repository          Size
===================================================================================================
Installing:
 datacenter-gpu-manager-4-cuda12               x86_64         1:4.4.2-1   cuda-rhel8-x86_64   554 M
Installing dependencies:
 datacenter-gpu-manager-4-core                 x86_64         1:4.4.2-1   cuda-rhel8-x86_64   9.9 M
Installing weak dependencies:
 datacenter-gpu-manager-4-proprietary          x86_64         1:4.4.2-1   cuda-rhel8-x86_64   5.3 M
 datacenter-gpu-manager-4-proprietary-cuda12   x86_64         1:4.4.2-1   cuda-rhel8-x86_64   289 M

Transaction Summary
====================================================================================================
Install  4 Packages
...
Is this ok [y/N]: y

Updating Subscription Management repositories.
Unable to read consumer identity

This system is not registered with an entitlement server. You can use subscription-manager to register.

Last metadata expiration check: 0:07:12 ago on Wed 19 Nov 2025 12:26:56 AM EST.
Dependencies resolved.
===================================================================================================
 Package                                       Architecture   Version     Repository          Size
===================================================================================================
Installing:
 datacenter-gpu-manager-4-cuda12               x86_64         1:4.4.2-1   cuda-rhel8-x86_64   554 M
Installing dependencies:
 datacenter-gpu-manager-4-core                 x86_64         1:4.4.2-1   cuda-rhel8-x86_64   9.9 M
Installing weak dependencies:
 datacenter-gpu-manager-4-proprietary          x86_64         1:4.4.2-1   cuda-rhel8-x86_64   5.3 M
 datacenter-gpu-manager-4-proprietary-cuda12   x86_64         1:4.4.2-1   cuda-rhel8-x86_64   289 M

Transaction Summary
====================================================================================================
Install  4 Packages
...
Is this ok [y/N]: y

Code block. Example of datacenter-gpu-manager-cuda installation result

datacenter-gpu-manager-exporter installation (for RHEL)

Add the CUDA Repository to DNF. 1. (If you have already performed this command, proceed to the next step.)

Color mode

dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo

dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo

code block. Add DNF Repository

Check the CUDA version. 2. (If you have already performed this command, proceed to the next step.)
Color mode
nvidia-smi | grep CUDA
nvidia-smi | grep CUDA
Code block. Check CUDA version
Color mode
| NVIDIA-SMI 535.183.06 Driver Version: 535.183.06 CUDA Version: 12.2 |
| NVIDIA-SMI 535.183.06 Driver Version: 535.183.06 CUDA Version: 12.2 |
Code block. Example of CUDA version check result
Color mode
CUDA_VERSION=12
CUDA_VERSION=12
Code block. CUDA version setting command

Check the list of datacenter-gpu-manager-exporter modules.

Color mode

dnf list datacenter-gpu-manager-exporter --showduplicates

dnf list datacenter-gpu-manager-exporter --showduplicates

Code block. Check the list of datacenter-gpu-manager-exporter modules

Color mode

Updating Subscription Management repositories.
Unable to read consumer identity

This system is not registered with an entitlement server. You can use subscription-manager to register.

Last metadata expiration check: 0:02:11 ago on Wed 19 Nov 2025 12:26:56 AM EST.
Available Packages
datacenter-gpu-manager-exporter.x86_64   4.0.1-1   cuda-rhel8-x86_64
datacenter-gpu-manager-exporter.x86_64   4.1.0-1   cuda-rhel8-x86_64
datacenter-gpu-manager-exporter.x86_64   4.1.1-1   cuda-rhel8-x86_64
datacenter-gpu-manager-exporter.x86_64   4.1.3-1   cuda-rhel8-x86_64
datacenter-gpu-manager-exporter.x86_64   4.5.0-1   cuda-rhel8-x86_64
datacenter-gpu-manager-exporter.x86_64   4.5.1-1   cuda-rhel8-x86_64
datacenter-gpu-manager-exporter.x86_64   4.5.2-1   cuda-rhel8-x86_64
datacenter-gpu-manager-exporter.x86_64   4.6.0-1   cuda-rhel8-x86_64

Updating Subscription Management repositories.
Unable to read consumer identity

This system is not registered with an entitlement server. You can use subscription-manager to register.

Last metadata expiration check: 0:02:11 ago on Wed 19 Nov 2025 12:26:56 AM EST.
Available Packages
datacenter-gpu-manager-exporter.x86_64   4.0.1-1   cuda-rhel8-x86_64
datacenter-gpu-manager-exporter.x86_64   4.1.0-1   cuda-rhel8-x86_64
datacenter-gpu-manager-exporter.x86_64   4.1.1-1   cuda-rhel8-x86_64
datacenter-gpu-manager-exporter.x86_64   4.1.3-1   cuda-rhel8-x86_64
datacenter-gpu-manager-exporter.x86_64   4.5.0-1   cuda-rhel8-x86_64
datacenter-gpu-manager-exporter.x86_64   4.5.1-1   cuda-rhel8-x86_64
datacenter-gpu-manager-exporter.x86_64   4.5.2-1   cuda-rhel8-x86_64
datacenter-gpu-manager-exporter.x86_64   4.6.0-1   cuda-rhel8-x86_64

code block. Example of checking the module list of datacenter-gpu-manager-exporter

Install datacenter-gpu-manager-cuda. dcgm-exporter 4.5.X requires glibc 2.34 or newer, but since RHEL9 provides glibc 2.34, we specify the version as 4.1.3-1 for installation.

Color mode

dnf install datacenter-gpu-manager-exporter-4.1.3-1

dnf install datacenter-gpu-manager-exporter-4.1.3-1

code block. datacenter-gpu-manager-cuda installation

Color mode

Updating Subscription Management repositories.
Unable to read consumer identity

This system is not registered with an entitlement server. You can use subscription-manager to register.

Last metadata expiration check: 0:07:12 ago on Wed 19 Nov 2025 12:26:56 AM EST.
Dependencies resolved.
====================================================================================================
 Package                                       Architecture   Version     Repository          Size
====================================================================================================
Installing:
 datacenter-gpu-manager-exporter               x86_64         4.1.3-1     cuda-rhel8-x86_64   26 M


Is this ok [y/N]: y

Updating Subscription Management repositories.
Unable to read consumer identity

This system is not registered with an entitlement server. You can use subscription-manager to register.

Last metadata expiration check: 0:07:12 ago on Wed 19 Nov 2025 12:26:56 AM EST.
Dependencies resolved.
====================================================================================================
 Package                                       Architecture   Version     Repository          Size
====================================================================================================
Installing:
 datacenter-gpu-manager-exporter               x86_64         4.1.3-1     cuda-rhel8-x86_64   26 M


Is this ok [y/N]: y

Code block. Example of datacenter-gpu-manager-cuda installation result

Color mode

cat /usr/lib/systemd/system/nvidia-dcgm-exporter.service | grep ExecStart

cat /usr/lib/systemd/system/nvidia-dcgm-exporter.service | grep ExecStart

code block. datacenter-gpu-manager-exporter configuration file

Color mode

ExecStart=/usr/bin/dcgm-exporter -f /etc/dcgm-exporter/default-counters.csv

ExecStart=/usr/bin/dcgm-exporter -f /etc/dcgm-exporter/default-counters.csv

code block. Example of checking the datacenter-gpu-manager-exporter configuration file.

When installing the DCGM Exporter, review the provided settings and remove # for required metrics, and add # for unnecessary metrics.

Color mode

vi /etc/dcgm-exporter/default-counters.csv
## Example ##

DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active.
DCGM_FI_PROF_DRAM_ACTIVE,        gauge, Ratio of cycles the device memory interface is active sending or receiving data.
# DCGM_FI_PROF_PIPE_FP64_ACTIVE,   gauge, Ratio of cycles the fp64 pipes are active.
# DCGM_FI_PROF_PIPE_FP32_ACTIVE,   gauge, Ratio of cycles the fp32 pipes are active.

vi /etc/dcgm-exporter/default-counters.csv
## Example ##

DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active.
DCGM_FI_PROF_DRAM_ACTIVE,        gauge, Ratio of cycles the device memory interface is active sending or receiving data.
# DCGM_FI_PROF_PIPE_FP64_ACTIVE,   gauge, Ratio of cycles the fp64 pipes are active.
# DCGM_FI_PROF_PIPE_FP32_ACTIVE,   gauge, Ratio of cycles the fp32 pipes are active.

Code block. datacenter-gpu-manager-exporter metric configuration example

Reference

For the metrics that can be collected with the GPU DCGM Exporter and how to configure them, see DCGM Exporter 지표.

Caution

Enable and start DCGM service (for RHEL)

Enable and start the nvdia-dcgm service.
Color mode
systemctl enable --now nvidia-dcgm
systemctl enable --now nvidia-dcgm
code block. nvdia-dcgm service enable and start command
Enable and start the nvdia-dcgm-exporter service.
Color mode
systemctl enable --now nvidia-dcgm-exporter
systemctl enable --now nvidia-dcgm-exporter
code block. nvdia-dcgm-exporter service enable and start command

Info

DCGM Exporter metrics

DCGM Exporter Key Metrics

Among the metrics provided by the DCGM Exporter, the main GPU metrics are listed below.

Category	DCGM Field	Prometheus Metric Type	Summary
Clocks	DCGM_FI_DEV_SM_CLOCK	gauge	SM clock frequency (in MHz)
Clocks	DCGM_FI_DEV_MEM_CLOCK	gauge	Memory clock frequency (in MHz)
Temperature	DCGM_FI_DEV_GPU_TEMP	gauge	GPU temperature (in C)
Power	DCGM_FI_DEV_POWER_USAGE	gauge	Power draw (in W)
Utilization	DCGM_FI_DEV_GPU_UTIL	gauge	GPU utilization (in %)
Utilization	DCGM_FI_DEV_MEM_COPY_UTIL	gauge	Memory utilization (in %)
Memory Usage	DCGM_FI_DEV_FB_FREE	gauge	Frame buffer memory free (in MiB)
Memory Usage	DCGM_FI_DEV_FB_USED	gauge	Frame buffer memory used (in MiB)
Nvlink	DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL(8 GPU only)	counter	Total number of NVLink bandwidth counters for all lanes

Table. Main GPU metrics provided by the DCGM Exporter

DCGM Exporter metric collection configuration

Refer to the default metrics configured for DCGM Exporter at DCGM Exporter > 기본 지표.

In addition to the default settings, remove # from default-counters.csv for the additional metrics to be configured.
For default metrics you do not wish to collect, add # or delete the corresponding item.

Color mode

# Format
# If line starts with a '#' it is considered a comment
# DCGM FIELD, Prometheus metric type, help message

# Clocks
DCGM_FI_DEV_SM_CLOCK, gauge, SM clock frequency (in MHz).
DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).

# Temperature
DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).
DCGM_FI_DEV_GPU_TEMP,    gauge, GPU temperature (in C).

# Power
DCGM_FI_DEV_POWER_USAGE,              gauge, Power draw (in W).
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ).

# PCIE
# DCGM_FI_PROF_PCIE_TX_BYTES,  counter, Total number of bytes transmitted through PCIe TX via NVML.
# DCGM_FI_PROF_PCIE_RX_BYTES,  counter, Total number of bytes received through PCIe RX via NVML.
...

# Format
# If line starts with a '#' it is considered a comment
# DCGM FIELD, Prometheus metric type, help message

# Clocks
DCGM_FI_DEV_SM_CLOCK, gauge, SM clock frequency (in MHz).
DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).

# Temperature
DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).
DCGM_FI_DEV_GPU_TEMP,    gauge, GPU temperature (in C).

# Power
DCGM_FI_DEV_POWER_USAGE,              gauge, Power draw (in W).
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ).

# PCIE
# DCGM_FI_PROF_PCIE_TX_BYTES,  counter, Total number of bytes transmitted through PCIe TX via NVML.
# DCGM_FI_PROF_PCIE_RX_BYTES,  counter, Total number of bytes received through PCIe RX via NVML.
...

Code block. default-counters.csv configuration example

3 - API Reference

API Reference

4 - CLI Reference

CLI Reference

5 - Release Note

GPU Server

2026.07.16

FEATURE Add new server type and other feature changes

We provide NPU server types to diversify AI accelerators.
- You can select and use an OS image for the NPU.
The n1 server type and the A100 HPE server type have been added.
- For detailed specifications, refer to 서버 타입.
Even if the Backup service is connected, you can delete the server.
- When a server is deleted, the schedule of the associated backup is removed, but the backup itself remains intact and can be restored and utilized.

2026.03.19

FEATURE Add Kubernetes image and SSD_Provisioned disk type

Add Kubernetes image
- When creating a GPU server, you can select the Kubernetes image (Ubuntu).
An SSD volume with configurable IOPS and Throughput has been added.
- When creating Block Storage, you can select the SSD_Provisioned disk type.
- You can set the maximum IOPS and Throughput.

2025.10.23

FEATURE Add new features and provide ServiceWatch service integration functionality

ServiceWatch service integration provision
- You can monitor data through the ServiceWatch service.
When creating a GPU Server, you can select a RHEL image.
Keypair management feature has been added.
- You can create a keypair for use or retrieve a public key and apply it.

2025.07.01

FEATURE Add GPU Server feature, change Image sharing method, and add GPU Server usage guide

Add GPU Server feature
- IP, Public NAT IP, Private NAT IP configuration feature has been added.
- An LLM Endpoint is provided for LLM usage.
The method for sharing images between accounts has changed.
- You can create a new Image for sharing.
Add GPU Server usage guide
- Using Multi-instance GPU on GPU Server and Using NVSwitch on GPU Server added the guides.

2025.04.28

FEATURE Add OS image

GPU Server RHEL OS and GPU driver versions have been added.

2025.02.27

FEATURE Common feature change

Add GPU Server feature
- NAT configuration functionality has been added to the GPU Server.
Samsung Cloud Platform Common Feature Change
- Account, IAM, and Service Home, tags, etc., have been updated to reflect common CX changes.

2024.10.01

NEW GPU Server Service Official Version Release

We have officially launched the GPU Server service.
We have launched a virtualization computing service that lets you allocate and use infrastructure resources such as CPU, GPU, and memory provided by the server as needed, without having to purchase them individually.

GPU Server

1 - Overview

Service Overview

Provided features

Components

Specifications by GPU Type

NPU Type Specifications

Server type

OS and driver version

Constraints

Preceding Service

1.1 - Server type

GPU Server server type

g1 server type

g2 server type

g3 server type

n1 server type

1.2 - Monitoring Metrics

GPU Server Monitoring Metrics

1.3 - ServiceWatch Metrics

Basic Metrics

2 - How-to guides

Create GPU Server

Check GPU Server detailed information

Detailed Information

Tag

Job History

Control GPU Server Operation

Getting Started with GPU Server

Stop GPU Server

Restart GPU Server

GPU Server Resource Management

Create Image

Enable detailed monitoring for ServiceWatch

Disable detailed monitoring of ServiceWatch

GPU Server management additional features

Check console log

Create Dump

Execute Rebuild

Terminate GPU Server

Termination constraints

2.1 - Manage Image

Create Image

Check Image detailed information

Detailed Information

tag

Job History

Image Resource Management

Share to another Account

Receive sharing from another Account

Delete Image

2.2 - Manage Keypair

Create a Keypair

View detailed information of the Keypair

Detailed Information

tag

Job History

Keypair Resource Management

Get public key

Delete Keypair

2.3 - Use Multi-instance GPU on GPU Server

NVIDIA Multi-instance GPU Introduction

Using Multi-instance GPU feature

Applying and Using MIG

Activate MIG

GPU Instance creation

Compute Instance creation

Using MIG

Delete and release MIG Instance

Compute Instance Delete

Delete GPU Instance

Disable MIG feature (deactivation)

2.4 - Use NVSwitch on GPU Server

Exploring NVIDIA NVSwitch for Multi GPU

Checking NVSwitch operation

NVIDIA Fabric Manager operating status

Check NVIDIA NVLink topology

Check NVIDIA NVLink Status

2.5 - Install ServiceWatch Agent

ServiceWatch Agent