Install ServiceWatch Agent
Users can install the ServiceWatch Agent on a GPU server to collect custom metrics and logs.
ServiceWatch Agent
There are two main types of agents that need to be installed on a GPU server to collect custom metrics and logs for ServiceWatch. It is a Prometheus Exporter and Open Telemetry Collector.
| Category | Detailed description | |
|---|---|---|
| Prometheus Exporter | Provide metrics of a specific application or service in a format that Prometheus can scrape
| |
| Open Telemetry Collector | Acts as a centralized collector that gathers telemetry data such as metrics and logs from distributed systems, processes (filtering, sampling, etc.) it, and exports it to multiple backends (e.g., Prometheus, Jaeger, Elasticsearch, etc.)
|
If you have configured a Kubernetes Engine on a GPU server, please check GPU metrics using the metrics provided by the Kubernetes Engine.
- If you install the DCGM Exporter on a GPU server configured with Kubernetes Engine, it may not operate correctly.
Pre-configuration for Using ServiceWatch Agent
To utilize the ServiceWatch Agent, please refer to Prerequisite Settings for ServiceWatch Agent and prepare the prerequisite settings.
Install Prometheus Exporter for GPU metrics (for Ubuntu)
Install the Prometheus Exporter for collecting metrics from the GPU server according to the steps below.
Verify NVDIA Driver Installation
- Check the installed NVDIA Driver.Color mode
nvidia-smi --query-gpu driver_version --format csvnvidia-smi --query-gpu driver_version --format csvCode block. NVDIA Driver version check command Color modedriver_version 535.183.06 ... 535.183.06driver_version 535.183.06 ... 535.183.06Code block. NVDIA Driver version check example
NVSwitch Configuration and Query (NSCQ) Library Installation
Install cuda-keyring.
Color modewget https://developer.download.nvidia.com/compute/cuda/repos/<distro>/<arch>/cuda-keyring_1.1-1_all.debwget https://developer.download.nvidia.com/compute/cuda/repos/<distro>/<arch>/cuda-keyring_1.1-1_all.debcode block. NSCQ library download command Color modesudo dpkg -i cuda-keyring_1.1-1_all.deb apt updatesudo dpkg -i cuda-keyring_1.1-1_all.deb apt updateCode block. NSCQ library installation command Color modenvidia-smi --query-gpu driver_version --format csvnvidia-smi --query-gpu driver_version --format csvCode block. NVDIA Driver version check command Color modedriver_version 535.183.06 ... 535.183.06driver_version 535.183.06 ... 535.183.06code block. Example of checking NVDIA driver version Install libnvidia-nscq.
Color modeapt-cache policy libnvidia-nscq-535apt-cache policy libnvidia-nscq-535Code block. NSCQ library apt-cache command Color modelibnvidia-nscq-535: Installed: (none) Candidate: 535.247.01-1 Version table: 535.247.01-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages ... 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages 535.216.01-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages 535.183.06-1 600 # Install the version that matches the driver 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages 535.183.01-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages 535.54.03-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packageslibnvidia-nscq-535: Installed: (none) Candidate: 535.247.01-1 Version table: 535.247.01-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages ... 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages 535.216.01-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages 535.183.06-1 600 # Install the version that matches the driver 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages 535.183.01-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages 535.54.03-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 PackagesCode block. NSCQ library apt-cache command result Color modeapt install libnvidia-nscq-535=535.183.06-1apt install libnvidia-nscq-535=535.183.06-1Code block. NSCQ library installation command
It must be installed with the same version as the NVDIA Driver.
- Example) driver version: 535.183.06, libnvdia-nscq version: 535.183.06-1
NVSwitch Device Monitoring API(NVSDM) Library Installation
- Install the NVSDM library.Color mode
apt-cache policy libnvsdmapt-cache policy libnvsdmCode block. NVSDM library apt-cache command Color modelibnvsdm: Installed: (none) Candidate: 580.105.08-1 Version table: 580.105.08-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages 580.95.05-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages 580.82.07-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages 580.65.06-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packageslibnvsdm: Installed: (none) Candidate: 580.105.08-1 Version table: 580.105.08-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages 580.95.05-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages 580.82.07-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages 580.65.06-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 PackagesCode block. NVSDM library apt-cache command result Color modeapt install libnvsdm=580.105.08-1apt install libnvsdm=580.105.08-1Code block. Install NVSDM library
Install NVIDIA DCGM (for Ubuntu)
Install the DCGM Exporter according to the steps below.
- DCGM(datacenter-gpu-manager) Installation
- datacenter-gpu-manager-exporter installation
- DCGM Service Activation and Start
DCGM(datacenter-gpu-manager) Installation
refers to a specific version of NVIDIA’s Data Center GPU Manager (DCGM) tool, which is a package for managing and monitoring NVIDIA data center GPUs. In particular, cuda12 indicates that this management tool is installed for the CUDA 12 version, and datacenter-gpu-manager-4 refers to the 4.x version of DCGM. This tool provides a variety of features, including GPU status monitoring, diagnostics, alert system, and power/clock management.
- Check the CUDA version.Color mode
nvidia-smi | grep CUDAnvidia-smi | grep CUDACode block. Check CUDA version Color mode| NVIDIA-SMI 535.183.06 Driver Version: 535.183.06 CUDA Version: 12.2 || NVIDIA-SMI 535.183.06 Driver Version: 535.183.06 CUDA Version: 12.2 |Code block. Example of CUDA version check result Color modeCUDA_VERSION=12CUDA_VERSION=12Code block. CUDA version setting command - Install datacenter-gpu-manager-cuda.Color mode
apt install datacenter-gpu-manager-4-cuda${CUDA_VERSION}apt install datacenter-gpu-manager-4-cuda${CUDA_VERSION}Code block. datacenter-gpu-manager-cuda installation command
Install datacenter-gpu-manager-exporter
It is a tool that, based on NVIDIA Data Center GPU Manager (DCGM), collects various GPU metrics such as GPU usage, memory usage, temperature, and power consumption, and exposes them for use in monitoring systems like Prometheus.
- Install datacenter-gpu-manager-exporter.Color mode
apt install datacenter-gpu-manager-exporterapt install datacenter-gpu-manager-exporterCode block. datacenter-gpu-manager-exporter installation command - Check the DCGM Exporter configuration file.Color mode
cat /usr/lib/systemd/system/nvidia-dcgm-exporter.service | grep ExecStartcat /usr/lib/systemd/system/nvidia-dcgm-exporter.service | grep ExecStartCode block. Command to check the datacenter-gpu-manager-exporter configuration file Color modeExecStart=/usr/bin/dcgm-exporter -f /etc/dcgm-exporter/default-counters.csvExecStart=/usr/bin/dcgm-exporter -f /etc/dcgm-exporter/default-counters.csvcode block. Example of verification result for the datacenter-gpu-manager-exporter configuration file - When installing the DCGM Exporter, review the provided configuration and remove
#for the required metrics, and add#for the unnecessary metrics.Color modevi /etc/dcgm-exporter/default-counters.csv ## Example ## ... DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active. DCGM_FI_PROF_DRAM_ACTIVE, gauge, Ratio of cycles the device memory interface is active sending or receiving data. # DCGM_FI_PROF_PIPE_FP64_ACTIVE, gauge, Ratio of cycles the fp64 pipes are active. # DCGM_FI_PROF_PIPE_FP32_ACTIVE, gauge, Ratio of cycles the fp32 pipes are active. ...vi /etc/dcgm-exporter/default-counters.csv ## Example ## ... DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active. DCGM_FI_PROF_DRAM_ACTIVE, gauge, Ratio of cycles the device memory interface is active sending or receiving data. # DCGM_FI_PROF_PIPE_FP64_ACTIVE, gauge, Ratio of cycles the fp64 pipes are active. # DCGM_FI_PROF_PIPE_FP32_ACTIVE, gauge, Ratio of cycles the fp32 pipes are active. ...Code block. Example of datacenter-gpu-manager-exporter metric configuration
Enable and start DCGM service
Enable and start the nvdia-dcgm service.
Color modesystemctl enable --now nvidia-dcgmsystemctl enable --now nvidia-dcgmCode block. nvdia-dcgm service activation and start command Enable and start the nvdia-dcgm-exporter service.
Color modesystemctl enable --now nvidia-dcgm-exportersystemctl enable --now nvidia-dcgm-exporterCode block. nvdia-dcgm-exporter service activation and start command
For more details, refer to ServiceWatch > ServiceWatch Agent 사용하기.
Install Prometheus Exporter for GPU metrics (for RHEL)
Install the ServiceWatch Agent to collect metrics from the GPU server according to the steps below.
NVDIA Driver installation verification (for RHEL)
- Check the installed NVDIA Driver.Color mode
nvidia-smi --query-gpu driver_version --format csvnvidia-smi --query-gpu driver_version --format csvCode block. NVDIA Driver version check command Color modedriver_version 535.183.06 ... 535.183.06driver_version 535.183.06 ... 535.183.06Code block. Example of checking NVDIA Driver version
NVSwitch Configuration and Query (NSCQ) Library installation (for RHEL)
The NVSwitch Configuration and Query (NSCQ) Library is required for Hopper or earlier generation GPUs.
- For RHEL, verify that libnvdia-nscq is installed and install it if necessary.
Checking the libnvdia-nscq package.
Color moderpm -qa | grep libnvidia-nscq libnvidia-nscq-535-535.183.06-1.x86_64rpm -qa | grep libnvidia-nscq libnvidia-nscq-535-535.183.06-1.x86_64Code block. Check NSCQ library package Add the CUDA Repository to DNF.
Color modednf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repodnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repoCode block. Add DNF Repository NVDIA Driver status reset
Color modednf module reset nvidia-driverdnf module reset nvidia-driverCode block. Initialize the state of the NVIDIA Driver DNF module Color modeUpdating Subscription Management repositories. Last metadata expiration check: 0:03:15 ago on Wed 19 Nov 2025 01:23:48 AM EST. Dependencies resolved. ============================================= Package Architecture Version Repository Size ============================================= Disabling module profiles: nvidia-driver/default nvidia-driver/fm Resetting modules: nvidia-driver Transaction Summary ============================================= Is this ok [y/N]: yUpdating Subscription Management repositories. Last metadata expiration check: 0:03:15 ago on Wed 19 Nov 2025 01:23:48 AM EST. Dependencies resolved. ============================================= Package Architecture Version Repository Size ============================================= Disabling module profiles: nvidia-driver/default nvidia-driver/fm Resetting modules: nvidia-driver Transaction Summary ============================================= Is this ok [y/N]: yCode block. Example of the status initialization result of the NVIDIA Driver DNF module. Enable the NVDIA Driver module.
Color modednf module enable nvidia-driver:535-opendnf module enable nvidia-driver:535-opencode block. NVDIA Driver module activation Color modeUpdating Subscription Management repositories. Last metadata expiration check: 0:04:22 ago on Wed 19 Nov 2025 01:23:48 AM EST. Dependencies resolved. ============================================= Package Architecture Version Repository Size ============================================= Enabling module streams: nvidia-driver 535-open Transaction Summary ============================================= Is this ok [y/N]: yUpdating Subscription Management repositories. Last metadata expiration check: 0:04:22 ago on Wed 19 Nov 2025 01:23:48 AM EST. Dependencies resolved. ============================================= Package Architecture Version Repository Size ============================================= Enabling module streams: nvidia-driver 535-open Transaction Summary ============================================= Is this ok [y/N]: yCode block. Example of NVDIA Driver module activation result Check the libnvdia-nscq module list.
Color modednf list libnvidia-nscq-535 --showduplicatesdnf list libnvidia-nscq-535 --showduplicatesCode block. Check libnvdia-nscq module list Install libnvdia-nscq.
Color modednf install libnvidia-nscq-535-535.183.06-1dnf install libnvidia-nscq-535-535.183.06-1code block. libnvdia-nscq installation command
NVSwitch Device Monitoring API(NVSDM) Library Installation (for RHEL)
Check the NVSDM library module list.
Color modednf list libnvsdm --showduplicatesdnf list libnvsdm --showduplicatesCode block. Check NVSDM library module list Color modelibnvsdm: Installed: (none) Candidate: 580.105.08-1 Version table: 580.105.08-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages 580.95.05-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages 580.82.07-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages 580.65.06-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packageslibnvsdm: Installed: (none) Candidate: 580.105.08-1 Version table: 580.105.08-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages 580.95.05-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages 580.82.07-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages 580.65.06-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 PackagesCode block. Example of NVSDM library module list verification results. Install libnvsdm.
Color modednf install libnvsdm-580.105.08-1dnf install libnvsdm-580.105.08-1Code block. Install NVSDM library Color modeUpdating Subscription Management repositories. Last metadata expiration check: 0:08:18 ago on Wed 19 Nov 2025 01:05:28 AM EST. Dependencies resolved. ========================================================================= Package Architecture Version Repository Size ========================================================================= Installing: libnvsdm x86_64 580.105.08-1 cuda-rhel8-x86_64 675 k Installing dependencies: infiniband-diags x86_64 48.0-1.el8 rhel-8-for-x86_64-baseos-rpms 323 k libibumad x86_64 48.0-1.el8 rhel-8-for-x86_64-baseos-rpms 34 k Transaction Summary ========================================================================= Install 3 Packages Total download size: 1.0 M Installed size: 3.2 M Is this ok [y/N]: yUpdating Subscription Management repositories. Last metadata expiration check: 0:08:18 ago on Wed 19 Nov 2025 01:05:28 AM EST. Dependencies resolved. ========================================================================= Package Architecture Version Repository Size ========================================================================= Installing: libnvsdm x86_64 580.105.08-1 cuda-rhel8-x86_64 675 k Installing dependencies: infiniband-diags x86_64 48.0-1.el8 rhel-8-for-x86_64-baseos-rpms 323 k libibumad x86_64 48.0-1.el8 rhel-8-for-x86_64-baseos-rpms 34 k Transaction Summary ========================================================================= Install 3 Packages Total download size: 1.0 M Installed size: 3.2 M Is this ok [y/N]: yCode block. Example of the NVSDM library installation command result
Install NVIDIA DCGM (for RHEL)
Install Node Exporter according to the steps below.
- DCGM(datacenter-gpu-manager) Installation
- datacenter-gpu-manager-exporter installation
- DCGM Service Activation and Start
DCGM(datacenter-gpu-manager) Installation (for RHEL)
refers to a specific version of NVIDIA’s Data Center GPU Manager (DCGM) tool, which is a package for managing and monitoring NVIDIA data center GPUs. In particular, cuda12 indicates that this management tool is installed for the CUDA 12 version, and datacenter-gpu-manager-4 refers to the 4.x version of DCGM. This tool provides various features, including GPU status monitoring, diagnostics, alert system, and power/clock management.
- Add the CUDA repository to DNF.Color mode
dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repodnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repocode block. Add DNF Repository - Check the CUDA version.Color mode
nvidia-smi | grep CUDAnvidia-smi | grep CUDACode block. Check CUDA version Color mode| NVIDIA-SMI 535.183.06 Driver Version: 535.183.06 CUDA Version: 12.2 || NVIDIA-SMI 535.183.06 Driver Version: 535.183.06 CUDA Version: 12.2 |Code block. Example of CUDA version check result Color modeCUDA_VERSION=12CUDA_VERSION=12Code block. CUDA version setting command - Check the list of datacenter-gpu-manager-cuda modules.Color mode
dnf list datacenter-gpu-manager-4-cuda${CUDA_VERSION} --showduplicatesdnf list datacenter-gpu-manager-4-cuda${CUDA_VERSION} --showduplicatesCode block. Check the datacenter-gpu-manager-cuda module list Color modeUpdating Subscription Management repositories. Unable to read consumer identity This system is not registered with an entitlement server. You can use subscription-manager to register. Last metadata expiration check: 0:00:34 ago on Wed 19 Nov 2025 12:26:56 AM EST. Available Packages datacenter-gpu-manager-4-cuda12.x86_64 1:4.0.0-1 cuda-rhel8-x86_64 datacenter-gpu-manager-4-cuda12.x86_64 1:4.1.0-1 cuda-rhel8-x86_64 datacenter-gpu-manager-4-cuda12.x86_64 1:4.1.1-1 cuda-rhel8-x86_64 datacenter-gpu-manager-4-cuda12.x86_64 1:4.2.0-1 cuda-rhel8-x86_64 datacenter-gpu-manager-4-cuda12.x86_64 1:4.2.2-1 cuda-rhel8-x86_64 datacenter-gpu-manager-4-cuda12.x86_64 1:4.2.3-1 cuda-rhel8-x86_64 datacenter-gpu-manager-4-cuda12.x86_64 1:4.2.3-2 cuda-rhel8-x86_64 datacenter-gpu-manager-4-cuda12.x86_64 1:4.3.0-1 cuda-rhel8-x86_64 datacenter-gpu-manager-4-cuda12.x86_64 1:4.3.1-1 cuda-rhel8-x86_64 datacenter-gpu-manager-4-cuda12.x86_64 1:4.4.0-1 cuda-rhel8-x86_64 datacenter-gpu-manager-4-cuda12.x86_64 1:4.4.1-1 cuda-rhel8-x86_64 datacenter-gpu-manager-4-cuda12.x86_64 1:4.4.2-1 cuda-rhel8-x86_64Updating Subscription Management repositories. Unable to read consumer identity This system is not registered with an entitlement server. You can use subscription-manager to register. Last metadata expiration check: 0:00:34 ago on Wed 19 Nov 2025 12:26:56 AM EST. Available Packages datacenter-gpu-manager-4-cuda12.x86_64 1:4.0.0-1 cuda-rhel8-x86_64 datacenter-gpu-manager-4-cuda12.x86_64 1:4.1.0-1 cuda-rhel8-x86_64 datacenter-gpu-manager-4-cuda12.x86_64 1:4.1.1-1 cuda-rhel8-x86_64 datacenter-gpu-manager-4-cuda12.x86_64 1:4.2.0-1 cuda-rhel8-x86_64 datacenter-gpu-manager-4-cuda12.x86_64 1:4.2.2-1 cuda-rhel8-x86_64 datacenter-gpu-manager-4-cuda12.x86_64 1:4.2.3-1 cuda-rhel8-x86_64 datacenter-gpu-manager-4-cuda12.x86_64 1:4.2.3-2 cuda-rhel8-x86_64 datacenter-gpu-manager-4-cuda12.x86_64 1:4.3.0-1 cuda-rhel8-x86_64 datacenter-gpu-manager-4-cuda12.x86_64 1:4.3.1-1 cuda-rhel8-x86_64 datacenter-gpu-manager-4-cuda12.x86_64 1:4.4.0-1 cuda-rhel8-x86_64 datacenter-gpu-manager-4-cuda12.x86_64 1:4.4.1-1 cuda-rhel8-x86_64 datacenter-gpu-manager-4-cuda12.x86_64 1:4.4.2-1 cuda-rhel8-x86_64Code block. Example of checking the list of datacenter-gpu-manager-cuda modules. - Install datacenter-gpu-manager-cuda.Color mode
dnf install datacenter-gpu-manager-4-cuda${CUDA_VERSION}dnf install datacenter-gpu-manager-4-cuda${CUDA_VERSION}Code block. Install datacenter-gpu-manager-cuda Color modeUpdating Subscription Management repositories. Unable to read consumer identity This system is not registered with an entitlement server. You can use subscription-manager to register. Last metadata expiration check: 0:07:12 ago on Wed 19 Nov 2025 12:26:56 AM EST. Dependencies resolved. =================================================================================================== Package Architecture Version Repository Size =================================================================================================== Installing: datacenter-gpu-manager-4-cuda12 x86_64 1:4.4.2-1 cuda-rhel8-x86_64 554 M Installing dependencies: datacenter-gpu-manager-4-core x86_64 1:4.4.2-1 cuda-rhel8-x86_64 9.9 M Installing weak dependencies: datacenter-gpu-manager-4-proprietary x86_64 1:4.4.2-1 cuda-rhel8-x86_64 5.3 M datacenter-gpu-manager-4-proprietary-cuda12 x86_64 1:4.4.2-1 cuda-rhel8-x86_64 289 M Transaction Summary ==================================================================================================== Install 4 Packages ... Is this ok [y/N]: yUpdating Subscription Management repositories. Unable to read consumer identity This system is not registered with an entitlement server. You can use subscription-manager to register. Last metadata expiration check: 0:07:12 ago on Wed 19 Nov 2025 12:26:56 AM EST. Dependencies resolved. =================================================================================================== Package Architecture Version Repository Size =================================================================================================== Installing: datacenter-gpu-manager-4-cuda12 x86_64 1:4.4.2-1 cuda-rhel8-x86_64 554 M Installing dependencies: datacenter-gpu-manager-4-core x86_64 1:4.4.2-1 cuda-rhel8-x86_64 9.9 M Installing weak dependencies: datacenter-gpu-manager-4-proprietary x86_64 1:4.4.2-1 cuda-rhel8-x86_64 5.3 M datacenter-gpu-manager-4-proprietary-cuda12 x86_64 1:4.4.2-1 cuda-rhel8-x86_64 289 M Transaction Summary ==================================================================================================== Install 4 Packages ... Is this ok [y/N]: yCode block. Example of datacenter-gpu-manager-cuda installation result
datacenter-gpu-manager-exporter installation (for RHEL)
It is a tool that, based on NVIDIA Data Center GPU Manager (DCGM), collects various GPU metrics such as GPU usage, memory usage, temperature, and power consumption, and exposes them for use in monitoring systems like Prometheus.
Add the CUDA Repository to DNF. 1. (If you have already performed this command, proceed to the next step.)
Color modednf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repodnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repoCode block. Add DNF Repository Check the CUDA version. 2. (If you have already performed this command, proceed to the next step.)
Color modenvidia-smi | grep CUDAnvidia-smi | grep CUDACode block. Check CUDA version Color mode| NVIDIA-SMI 535.183.06 Driver Version: 535.183.06 CUDA Version: 12.2 || NVIDIA-SMI 535.183.06 Driver Version: 535.183.06 CUDA Version: 12.2 |Code block. Example of CUDA version check result Color modeCUDA_VERSION=12CUDA_VERSION=12Code block. CUDA version setting command Check the datacenter-gpu-manager-exporter module list.
Color modednf list datacenter-gpu-manager-exporter --showduplicatesdnf list datacenter-gpu-manager-exporter --showduplicatesCode block. Check the list of datacenter-gpu-manager-exporter modules Color modeUpdating Subscription Management repositories. Unable to read consumer identity This system is not registered with an entitlement server. You can use subscription-manager to register. Last metadata expiration check: 0:02:11 ago on Wed 19 Nov 2025 12:26:56 AM EST. Available Packages datacenter-gpu-manager-exporter.x86_64 4.0.1-1 cuda-rhel8-x86_64 datacenter-gpu-manager-exporter.x86_64 4.1.0-1 cuda-rhel8-x86_64 datacenter-gpu-manager-exporter.x86_64 4.1.1-1 cuda-rhel8-x86_64 datacenter-gpu-manager-exporter.x86_64 4.1.3-1 cuda-rhel8-x86_64 datacenter-gpu-manager-exporter.x86_64 4.5.0-1 cuda-rhel8-x86_64 datacenter-gpu-manager-exporter.x86_64 4.5.1-1 cuda-rhel8-x86_64 datacenter-gpu-manager-exporter.x86_64 4.5.2-1 cuda-rhel8-x86_64 datacenter-gpu-manager-exporter.x86_64 4.6.0-1 cuda-rhel8-x86_64Updating Subscription Management repositories. Unable to read consumer identity This system is not registered with an entitlement server. You can use subscription-manager to register. Last metadata expiration check: 0:02:11 ago on Wed 19 Nov 2025 12:26:56 AM EST. Available Packages datacenter-gpu-manager-exporter.x86_64 4.0.1-1 cuda-rhel8-x86_64 datacenter-gpu-manager-exporter.x86_64 4.1.0-1 cuda-rhel8-x86_64 datacenter-gpu-manager-exporter.x86_64 4.1.1-1 cuda-rhel8-x86_64 datacenter-gpu-manager-exporter.x86_64 4.1.3-1 cuda-rhel8-x86_64 datacenter-gpu-manager-exporter.x86_64 4.5.0-1 cuda-rhel8-x86_64 datacenter-gpu-manager-exporter.x86_64 4.5.1-1 cuda-rhel8-x86_64 datacenter-gpu-manager-exporter.x86_64 4.5.2-1 cuda-rhel8-x86_64 datacenter-gpu-manager-exporter.x86_64 4.6.0-1 cuda-rhel8-x86_64code block. Example of checking the module list of datacenter-gpu-manager-exporter Install datacenter-gpu-manager-cuda. dcgm-exporter 4.5.X requires glibc 2.34 or newer, but because RHEL 9 provides glibc 2.34, we install it by specifying version 4.1.3-1.
Color modednf install datacenter-gpu-manager-exporter-4.1.3-1dnf install datacenter-gpu-manager-exporter-4.1.3-1Code block. Install datacenter-gpu-manager-cuda Color modeUpdating Subscription Management repositories. Unable to read consumer identity This system is not registered with an entitlement server. You can use subscription-manager to register. Last metadata expiration check: 0:07:12 ago on Wed 19 Nov 2025 12:26:56 AM EST. Dependencies resolved. ==================================================================================================== Package Architecture Version Repository Size ==================================================================================================== Installing: datacenter-gpu-manager-exporter x86_64 4.1.3-1 cuda-rhel8-x86_64 26 M Is this ok [y/N]: yUpdating Subscription Management repositories. Unable to read consumer identity This system is not registered with an entitlement server. You can use subscription-manager to register. Last metadata expiration check: 0:07:12 ago on Wed 19 Nov 2025 12:26:56 AM EST. Dependencies resolved. ==================================================================================================== Package Architecture Version Repository Size ==================================================================================================== Installing: datacenter-gpu-manager-exporter x86_64 4.1.3-1 cuda-rhel8-x86_64 26 M Is this ok [y/N]: yCode block. Example of datacenter-gpu-manager-cuda installation result Color modecat /usr/lib/systemd/system/nvidia-dcgm-exporter.service | grep ExecStartcat /usr/lib/systemd/system/nvidia-dcgm-exporter.service | grep ExecStartCode block. datacenter-gpu-manager-exporter configuration file Color modeExecStart=/usr/bin/dcgm-exporter -f /etc/dcgm-exporter/default-counters.csvExecStart=/usr/bin/dcgm-exporter -f /etc/dcgm-exporter/default-counters.csvCode block. Example of the result of checking the datacenter-gpu-manager-exporter configuration file. When installing the DCGM Exporter, review the provided configuration and remove
#for required metrics, and add#for unnecessary metrics.Color modevi /etc/dcgm-exporter/default-counters.csv ## Example ## ... DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active. DCGM_FI_PROF_DRAM_ACTIVE, gauge, Ratio of cycles the device memory interface is active sending or receiving data. # DCGM_FI_PROF_PIPE_FP64_ACTIVE, gauge, Ratio of cycles the fp64 pipes are active. # DCGM_FI_PROF_PIPE_FP32_ACTIVE, gauge, Ratio of cycles the fp32 pipes are active. ...vi /etc/dcgm-exporter/default-counters.csv ## Example ## ... DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active. DCGM_FI_PROF_DRAM_ACTIVE, gauge, Ratio of cycles the device memory interface is active sending or receiving data. # DCGM_FI_PROF_PIPE_FP64_ACTIVE, gauge, Ratio of cycles the fp64 pipes are active. # DCGM_FI_PROF_PIPE_FP32_ACTIVE, gauge, Ratio of cycles the fp32 pipes are active. ...Code block. Example of datacenter-gpu-manager-exporter metric configuration
Enable and start DCGM service (for RHEL)
Activating and starting the nvdia-dcgm service.
Color modesystemctl enable --now nvidia-dcgmsystemctl enable --now nvidia-dcgmCode block. nvdia-dcgm service activation and start command Enable and start the nvdia-dcgm-exporter service.
Color modesystemctl enable --now nvidia-dcgm-exportersystemctl enable --now nvidia-dcgm-exporterCode block. nvdia-dcgm-exporter service activation and start command
For more details, refer to ServiceWatch > Using ServiceWatch Agent.
DCGM Exporter metric
DCGM Exporter Key Metrics
Among the metrics provided by the DCGM Exporter, the main GPU metrics are as follows.
| Category | DCGM Field | Prometheus Metric Type | Summary | |
|---|---|---|---|---|
| Clocks | DCGM_FI_DEV_SM_CLOCK | gauge | SM clock frequency (in MHz) | |
| Clocks | DCGM_FI_DEV_MEM_CLOCK | gauge | Memory clock frequency (in MHz) | |
| Temperature | DCGM_FI_DEV_GPU_TEMP | gauge | GPU temperature (in C) | |
| Power | DCGM_FI_DEV_POWER_USAGE | gauge | Power draw (in W) | |
| Utilization | DCGM_FI_DEV_GPU_UTIL | gauge | GPU utilization (in %) | |
| Utilization | DCGM_FI_DEV_MEM_COPY_UTIL | gauge | Memory utilization (in %) | |
| Memory Usage | DCGM_FI_DEV_FB_FREE | gauge | Frame buffer memory free (in MiB) | |
| Memory Usage | DCGM_FI_DEV_FB_USED | gauge | Frame buffer memory used (in MiB) | |
| Nvlink | DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL(8 GPU only) | counter | Total number of NVLink bandwidth counters for all lanes |
DCGM Exporter metric collection configuration
Please refer to the default metrics of DCGM Exporter at DCGM Exporter > 기본 지표.
- Besides the default configuration, remove
#from default-counters.csv for any additional metrics you want to set. - For default metrics you do not wish to collect, add
#or delete the corresponding entry.
# Format
# If line starts with a '#' it is considered a comment
# DCGM FIELD, Prometheus metric type, help message
# Clocks
DCGM_FI_DEV_SM_CLOCK, gauge, SM clock frequency (in MHz).
DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).
# Temperature
DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).
DCGM_FI_DEV_GPU_TEMP, gauge, GPU temperature (in C).
# Power
DCGM_FI_DEV_POWER_USAGE, gauge, Power draw (in W).
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ).
# PCIE
# DCGM_FI_PROF_PCIE_TX_BYTES, counter, Total number of bytes transmitted through PCIe TX via NVML.
# DCGM_FI_PROF_PCIE_RX_BYTES, counter, Total number of bytes received through PCIe RX via NVML.
...# Format
# If line starts with a '#' it is considered a comment
# DCGM FIELD, Prometheus metric type, help message
# Clocks
DCGM_FI_DEV_SM_CLOCK, gauge, SM clock frequency (in MHz).
DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).
# Temperature
DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).
DCGM_FI_DEV_GPU_TEMP, gauge, GPU temperature (in C).
# Power
DCGM_FI_DEV_POWER_USAGE, gauge, Power draw (in W).
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ).
# PCIE
# DCGM_FI_PROF_PCIE_TX_BYTES, counter, Total number of bytes transmitted through PCIe TX via NVML.
# DCGM_FI_PROF_PCIE_RX_BYTES, counter, Total number of bytes received through PCIe RX via NVML.
...