ServiceWatch Agent Installation
Users can install ServiceWatch Agent on GPU Server to collect custom metrics and logs.
ServiceWatch Agent
There are two main types of Agents that must be installed to collect ServiceWatch custom metrics and logs on GPU Server. Prometheus Exporter and Open Telemetry Collector.
| Category | Detailed Description | |
|---|---|---|
| Prometheus Exporter | Provides metrics of specific applications or services in a format that Prometheus can scrape
| |
| Open Telemetry Collector | A centralized collector that collects telemetry data such as metrics and logs from distributed systems, processes (filtering, sampling, etc.), and sends them to multiple backends (e.g., Prometheus, Jaeger, Elasticsearch, etc.)
|
If Kubernetes Engine is configured on GPU Server, please check GPU metrics through the metrics provided by Kubernetes Engine.
- If you install DCGM Exporter on a GPU Server with Kubernetes Engine configured, it may not work properly.
Prerequisites for Using ServiceWatch Agent
Step 1. Check NVIDIA Driver installation
Install Prometheus Exporter for GPU Metrics (for Ubuntu)
Install the Prometheus Exporter to collect GPU Server metrics in the following order.
Check NVDIA Driver Installation
- Check the installed NVDIA Driver.Color mode
nvidia-smi --query-gpu driver_version --format csvnvidia-smi --query-gpu driver_version --format csvCode block. Check NVDIA Driver version command Color modedriver_version 535.183.06 ... 535.183.06driver_version 535.183.06 ... 535.183.06Code block. Check NVDIA Driver version example
Install NVSwitch Configuration and Query (NSCQ) Library
Install cuda-keyring.
Color modewget https://developer.download.nvidia.com/compute/cuda/repos/<distro>/<arch>/cuda-keyring_1.1-1_all.debwget https://developer.download.nvidia.com/compute/cuda/repos/<distro>/<arch>/cuda-keyring_1.1-1_all.debCode block. NSCQ library download command Color modesudo dpkg -i cuda-keyring_1.1-1_all.deb apt updatesudo dpkg -i cuda-keyring_1.1-1_all.deb apt updateCode block. NSCQ library installation command Color modenvidia-smi --query-gpu driver_version --format csvnvidia-smi --query-gpu driver_version --format csvCode block. Check NVDIA Driver version command Color modedriver_version 535.183.06 ... 535.183.06driver_version 535.183.06 ... 535.183.06Code block. Check NVDIA Driver version example Install libnvidia-nscq.
Color modeapt-cache policy libnvidia-nscq-535apt-cache policy libnvidia-nscq-535Code block. NSCQ library apt-cache command Color modelibnvidia-nscq-535: Installed: (none) Candidate: 535.247.01-1 Version table: 535.247.01-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages ... 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages 535.216.01-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages 535.183.06-1 600 # Install the version that matches the Driver 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages 535.183.01-1 600 ... 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages 535.54.03-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packageslibnvidia-nscq-535: Installed: (none) Candidate: 535.247.01-1 Version table: 535.247.01-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages ... 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages 535.216.01-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages 535.183.06-1 600 # Install the version that matches the Driver 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages 535.183.01-1 600 ... 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages 535.54.03-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 PackagesCode block. NSCQ library apt-cache command result Color modeapt install libnvidia-nscq-535=535.183.06-1apt install libnvidia-nscq-535=535.183.06-1Code block. NSCQ library installation command NoticeYou must install the same version as the NVDIA Driver version.
- Example) driver version: 535.183.06, libnvdia-nscq version: 535.183.06-1
Install NVSwitch Device Monitoring API(NVSDM) Library
- Install NVSDM library.Color mode
apt-cache policy libnvsdmapt-cache policy libnvsdmCode block. NVSDM library apt-cache command Color modelibnvsdm: Installed: (none) Candidate: 580.105.08-1 Version table: 580.105.08-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages 580.95.05-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages 580.82.07-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages 580.65.06-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packageslibnvsdm: Installed: (none) Candidate: 580.105.08-1 Version table: 580.105.08-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages 580.95.05-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages 580.82.07-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages 580.65.06-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 PackagesCode block. NVSDM library apt-cache command result Color modeapt install libnvsdm=580.105.08-1apt install libnvsdm=580.105.08-1Code block. NVSDM library installation
Install NVIDIA DCGM (for Ubuntu)
Install DCGM Exporter in the following order.
- Install datacenter-gpu-manager-exporter
- Enable and start DCGM service
- For metrics that can be collected with GPU DCGM Exporter and configuration methods, refer to DCGM Exporter Metrics.
Install DCGM(datacenter-gpu-manager)
Refers to a specific version of NVIDIA’s Data Center GPU Manager (DCGM) tool, which is a package for managing and monitoring NVIDIA data center GPUs. In particular, cuda12 indicates that this management tool is installed for CUDA 12 version, and datacenter-gpu-manager-4 means DCGM version 4.x. This tool provides various functions including GPU status monitoring, diagnostics, alert systems, and power/clock management.
- Check CUDA version.Color mode
nvidia-smi | grep CUDAnvidia-smi | grep CUDACode block. Check CUDA version Color mode| NVIDIA-SMI 535.183.06 Driver Version: 535.183.06 CUDA Version: 12.2 || NVIDIA-SMI 535.183.06 Driver Version: 535.183.06 CUDA Version: 12.2 |Code block. Check CUDA version result example Color modeCUDA_VERSION=12CUDA_VERSION=12Code block. Set CUDA version command - Install datacenter-gpu-manager-cuda.Color mode
apt install datacenter-gpu-manager-4-cuda${CUDA_VERSION}apt install datacenter-gpu-manager-4-cuda${CUDA_VERSION}Code block. datacenter-gpu-manager-cuda installation command
Install datacenter-gpu-manager-exporter
This is a tool that collects various GPU metrics such as GPU usage, memory usage, temperature, and power consumption based on NVIDIA Data Center GPU Manager (DCGM) and exposes them for use in monitoring systems like Prometheus.
- Install datacenter-gpu-manager-exporter.Color mode
apt install datacenter-gpu-manager-exporterapt install datacenter-gpu-manager-exporterCode block. datacenter-gpu-manager-exporter installation command - Check DCGM Exporter configuration file.Color mode
cat /usr/lib/systemd/system/nvidia-dcgm-exporter.service | grep ExecStartcat /usr/lib/systemd/system/nvidia-dcgm-exporter.service | grep ExecStartCode block. Check datacenter-gpu-manager-exporter configuration file command Color modeExecStart=/usr/bin/dcgm-exporter -f /etc/dcgm-exporter/default-counters.csvExecStart=/usr/bin/dcgm-exporter -f /etc/dcgm-exporter/default-counters.csvCode block. Check datacenter-gpu-manager-exporter configuration file result example - Check the configuration provided at DCGM Exporter installation, remove
#for necessary metrics, and add#for unnecessary metrics.Color modevi /etc/dcgm-exporter/default-counters.csv ## Example ## ... DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active. DCGM_FI_PROF_DRAM_ACTIVE, gauge, Ratio of cycles the device memory interface is active sending or receiving data. # DCGM_FI_PROF_PIPE_FP64_ACTIVE, gauge, Ratio of cycles the fp64 pipes are active. # DCGM_FI_PROF_PIPE_FP32_ACTIVE, gauge, Ratio of cycles the fp32 pipes are active. ...vi /etc/dcgm-exporter/default-counters.csv ## Example ## ... DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active. DCGM_FI_PROF_DRAM_ACTIVE, gauge, Ratio of cycles the device memory interface is active sending or receiving data. # DCGM_FI_PROF_PIPE_FP64_ACTIVE, gauge, Ratio of cycles the fp64 pipes are active. # DCGM_FI_PROF_PIPE_FP32_ACTIVE, gauge, Ratio of cycles the fp32 pipes are active. ...Code block. datacenter-gpu-manager-exporter metric configuration example ReferenceStep 1. Check NVIDIA Driver installation<brCautionMetric collection through ServiceWatch Agent is classified as custom metrics, and unlike basic metrics collected from each service, fees are charged. Therefore, unnecessary metric collection should be removed or disabled to avoid excessive charges.
Enable and Start DCGM Service
Enable and start nvdia-dcgm service.
Color modesystemctl enable --now nvidia-dcgmsystemctl enable --now nvidia-dcgmCode block. Enable and start nvdia-dcgm service command Enable and start nvdia-dcgm-exporter service.
Color modesystemctl enable --now nvidia-dcgm-exportersystemctl enable --now nvidia-dcgm-exporterCode block. Enable and start nvdia-dcgm-exporter service command
For details, refer to ServiceWatch > Use ServiceWatch Agent.
Install Prometheus Exporter for GPU Metrics (for RHEL)
Install ServiceWatch Agent to collect GPU Server metrics in the following order.
Check NVDIA Driver Installation (for RHEL)
- Check the installed NVDIA Driver.Color mode
nvidia-smi --query-gpu driver_version --format csvnvidia-smi --query-gpu driver_version --format csvCode block. Check NVDIA Driver version command Color modedriver_version 535.183.06 ... 535.183.06driver_version 535.183.06 ... 535.183.06Code block. Check NVDIA Driver version example
Install NVSwitch Configuration and Query (NSCQ) Library (for RHEL)
NVSwitch Configuration and Query (NSCQ) Library is required for Hopper or earlier Generation GPUs.
- For RHEL, check if libnvdia-nscq is installed and then install it.
Check libnvdia-nscq package.
Color moderpm -qa | grep libnvidia-nscq libnvidia-nscq-535-535.183.06-1.x86_64rpm -qa | grep libnvidia-nscq libnvidia-nscq-535-535.183.06-1.x86_64Code block. Check NSCQ library package Add CUDA Repository to DNF.
Color modednf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repodnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repoCode block. Add DNF Repository Reset NVDIA Driver status
Color modednf module reset nvidia-driverdnf module reset nvidia-driverCode block. Reset NVIDIA Driver DNF module status Color modeUpdating Subscription Management repositories. Last metadata expiration check: 0:03:15 ago on Wed 19 Nov 2025 01:23:48 AM EST. Dependencies resolved. ============================================= Package Architecture Version Repository Size ============================================= Disabling module profiles: nvidia-driver/default nvidia-driver/fm Resetting modules: nvidia-driver Transaction Summary ============================================= Is this ok [y/N]: yUpdating Subscription Management repositories. Last metadata expiration check: 0:03:15 ago on Wed 19 Nov 2025 01:23:48 AM EST. Dependencies resolved. ============================================= Package Architecture Version Repository Size ============================================= Disabling module profiles: nvidia-driver/default nvidia-driver/fm Resetting modules: nvidia-driver Transaction Summary ============================================= Is this ok [y/N]: yCode block. Reset NVIDIA Driver DNF module status result example Enable NVDIA Driver module.
Color modednf module enable nvidia-driver:535-opendnf module enable nvidia-driver:535-openCode block. Enable NVDIA Driver module Color modeUpdating Subscription Management repositories. Last metadata expiration check: 0:04:22 ago on Wed 19 Nov 2025 01:23:48 AM EST. Dependencies resolved. ============================================= Package Architecture Version Repository Size ============================================= Enabling module streams: nvidia-driver 535-open Transaction Summary ============================================= Is this ok [y/N]: yUpdating Subscription Management repositories. Last metadata expiration check: 0:04:22 ago on Wed 19 Nov 2025 01:23:48 AM EST. Dependencies resolved. ============================================= Package Architecture Version Repository Size ============================================= Enabling module streams: nvidia-driver 535-open Transaction Summary ============================================= Is this ok [y/N]: yCode block. Enable NVDIA Driver module result example Check libnvdia-nscq module list.
Color modednf list libnvidia-nscq-535 --showduplicatesdnf list libnvidia-nscq-535 --showduplicatesCode block. Check libnvdia-nscq module list Install libnvdia-nscq.
Color modednf install libnvidia-nscq-535-535.183.06-1dnf install libnvidia-nscq-535-535.183.06-1Code block. libnvdia-nscq installation command
Install NVSwitch Device Monitoring API(NVSDM) Library (for RHEL)
Check NVSDM library module list.
Color modednf list libnvsdm --showduplicatesdnf list libnvsdm --showduplicatesCode block. Check NVSDM library module list Color modelibnvsdm: Installed: (none) Candidate: 580.105.08-1 Version table: 580.105.08-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages 580.95.05-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages 580.82.07-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages 580.65.06-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packageslibnvsdm: Installed: (none) Candidate: 580.105.08-1 Version table: 580.105.08-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages 580.95.05-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages 580.82.07-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 Packages 580.65.06-1 600 600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 PackagesCode block. Check NVSDM library module list result example Install libnvsdm.
Color modednf install libnvsdm-580.105.08-1dnf install libnvsdm-580.105.08-1Code block. NVSDM library installation Color modeUpdating Subscription Management repositories. Last metadata expiration check: 0:08:18 ago on Wed 19 Nov 2025 01:05:28 AM EST. Dependencies resolved. ======================================================================== Package Architecture Version Repository Size ======================================================================== Installing: libnvsdm x86_64 580.105.08-1 cuda-rhel8-x86_64 675 k Installing dependencies: infiniband-diags x86_64 48.0-1.el8 rhel-8-for-x86_64-baseos-rpms 323 k libibumad x86_64 48.0-1.el8 rhel-8-for-x86_64-baseos-rpms 34 k Transaction Summary ======================================================================== Install 3 Packages Total download size: 1.0 M Installed size: 3.2 M Is this ok [y/N]: yUpdating Subscription Management repositories. Last metadata expiration check: 0:08:18 ago on Wed 19 Nov 2025 01:05:28 AM EST. Dependencies resolved. ======================================================================== Package Architecture Version Repository Size ======================================================================== Installing: libnvsdm x86_64 580.105.08-1 cuda-rhel8-x86_64 675 k Installing dependencies: infiniband-diags x86_64 48.0-1.el8 rhel-8-for-x86_64-baseos-rpms 323 k libibumad x86_64 48.0-1.el8 rhel-8-for-x86_64-baseos-rpms 34 k Transaction Summary ======================================================================== Install 3 Packages Total download size: 1.0 M Installed size: 3.2 M Is this ok [y/N]: yCode block. NVSDM library installation command result example
Install NVIDIA DCGM (for RHEL)
Install Node Exporter in the following order.
- Install datacenter-gpu-manager-exporter
- Enable and start DCGM service
- For metrics that can be collected with GPU DCGM Exporter and configuration methods, refer to DCGM Exporter Metrics.
Install DCGM(datacenter-gpu-manager) (for RHEL)
Refers to a specific version of NVIDIA’s Data Center GPU Manager (DCGM) tool, which is a package for managing and monitoring NVIDIA data center GPUs. In particular, cuda12 indicates that this management tool is installed for CUDA 12 version, and datacenter-gpu-manager-4 means DCGM version 4.x. This tool provides various functions including GPU status monitoring, diagnostics, alert systems, and power/clock management.
- Add CUDA Repository to DNF.Color mode
dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repodnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repoCode block. Add DNF Repository - Check CUDA version.Color mode
nvidia-smi | grep CUDAnvidia-smi | grep CUDACode block. Check CUDA version Color mode| NVIDIA-SMI 535.183.06 Driver Version: 535.183.06 CUDA Version: 12.2 || NVIDIA-SMI 535.183.06 Driver Version: 535.183.06 CUDA Version: 12.2 |Code block. Check CUDA version result example Color modeCUDA_VERSION=12CUDA_VERSION=12Code block. Set CUDA version command - Check datacenter-gpu-manager-cuda module list.Color mode
dnf list datacenter-gpu-manager-4-cuda${CUDA_VERSION} --showduplicatesdnf list datacenter-gpu-manager-4-cuda${CUDA_VERSION} --showduplicatesCode block. Check datacenter-gpu-manager-cuda module list Color modeUpdating Subscription Management repositories. Unable to read consumer identity This system is not registered with an entitlement server. You can use subscription-manager to register. Last metadata expiration check: 0:00:34 ago on Wed 19 Nov 2025 12:26:56 AM EST. Available Packages datacenter-gpu-manager-4-cuda12.x86_64 1:4.0.0-1 cuda-rhel8-x86_64 datacenter-gpu-manager-4-cuda12.x86_64 1:4.1.0-1 cuda-rhel8-x86_64 datacenter-gpu-manager-4-cuda12.x86_64 1:4.1.1-1 cuda-rhel8-x86_64 datacenter-gpu-manager-4-cuda12.x86_64 1:4.2.0-1 cuda-rhel8-x86_64 datacenter-gpu-manager-4-cuda12.x86_64 1:4.2.2-1 cuda-rhel8-x86_64 datacenter-gpu-manager-4-cuda12.x86_64 1:4.2.3-1 cuda-rhel8-x86_64 datacenter-gpu-manager-4-cuda12.x86_64 1:4.2.3-2 cuda-rhel8-x86_64 datacenter-gpu-manager-4-cuda12.x86_64 1:4.3.0-1 cuda-rhel8-x86_64 datacenter-gpu-manager-4-cuda12.x86_64 1:4.3.1-1 cuda-rhel8-x86_64 datacenter-gpu-manager-4-cuda12.x86_64 1:4.4.0-1 cuda-rhel8-x86_64 datacenter-gpu-manager-4-cuda12.x86_64 1:4.4.1-1 cuda-rhel8-x86_64 datacenter-gpu-manager-4-cuda12.x86_64 1:4.4.2-1 cuda-rhel8-x86_64Updating Subscription Management repositories. Unable to read consumer identity This system is not registered with an entitlement server. You can use subscription-manager to register. Last metadata expiration check: 0:00:34 ago on Wed 19 Nov 2025 12:26:56 AM EST. Available Packages datacenter-gpu-manager-4-cuda12.x86_64 1:4.0.0-1 cuda-rhel8-x86_64 datacenter-gpu-manager-4-cuda12.x86_64 1:4.1.0-1 cuda-rhel8-x86_64 datacenter-gpu-manager-4-cuda12.x86_64 1:4.1.1-1 cuda-rhel8-x86_64 datacenter-gpu-manager-4-cuda12.x86_64 1:4.2.0-1 cuda-rhel8-x86_64 datacenter-gpu-manager-4-cuda12.x86_64 1:4.2.2-1 cuda-rhel8-x86_64 datacenter-gpu-manager-4-cuda12.x86_64 1:4.2.3-1 cuda-rhel8-x86_64 datacenter-gpu-manager-4-cuda12.x86_64 1:4.2.3-2 cuda-rhel8-x86_64 datacenter-gpu-manager-4-cuda12.x86_64 1:4.3.0-1 cuda-rhel8-x86_64 datacenter-gpu-manager-4-cuda12.x86_64 1:4.3.1-1 cuda-rhel8-x86_64 datacenter-gpu-manager-4-cuda12.x86_64 1:4.4.0-1 cuda-rhel8-x86_64 datacenter-gpu-manager-4-cuda12.x86_64 1:4.4.1-1 cuda-rhel8-x86_64 datacenter-gpu-manager-4-cuda12.x86_64 1:4.4.2-1 cuda-rhel8-x86_64Code block. Check datacenter-gpu-manager-cuda module list result example - Install datacenter-gpu-manager-cuda.Color mode
dnf install datacenter-gpu-manager-4-cuda${CUDA_VERSION}dnf install datacenter-gpu-manager-4-cuda${CUDA_VERSION}Code block. datacenter-gpu-manager-cuda installation Color modeUpdating Subscription Management repositories. Unable to read consumer identity This system is not registered with an entitlement server. You can use subscription-manager to register. Last metadata expiration check: 0:07:12 ago on Wed 19 Nov 2025 12:26:56 AM EST. Dependencies resolved. =================================================================================================== Package Architecture Version Repository Size =================================================================================================== Installing: datacenter-gpu-manager-4-cuda12 x86_64 1:4.4.2-1 cuda-rhel8-x86_64 554 M Installing dependencies: datacenter-gpu-manager-4-core x86_64 1:4.4.2-1 cuda-rhel8-x86_64 9.9 M Installing weak dependencies: datacenter-gpu-manager-4-proprietary x86_64 1:4.4.2-1 cuda-rhel8-x86_64 5.3 M datacenter-gpu-manager-4-proprietary-cuda12 x86_64 1:4.4.2-1 cuda-rhel8-x86_64 289 M Transaction Summary ==================================================================================================== Install 4 Packages ... Is this ok [y/N]: yUpdating Subscription Management repositories. Unable to read consumer identity This system is not registered with an entitlement server. You can use subscription-manager to register. Last metadata expiration check: 0:07:12 ago on Wed 19 Nov 2025 12:26:56 AM EST. Dependencies resolved. =================================================================================================== Package Architecture Version Repository Size =================================================================================================== Installing: datacenter-gpu-manager-4-cuda12 x86_64 1:4.4.2-1 cuda-rhel8-x86_64 554 M Installing dependencies: datacenter-gpu-manager-4-core x86_64 1:4.4.2-1 cuda-rhel8-x86_64 9.9 M Installing weak dependencies: datacenter-gpu-manager-4-proprietary x86_64 1:4.4.2-1 cuda-rhel8-x86_64 5.3 M datacenter-gpu-manager-4-proprietary-cuda12 x86_64 1:4.4.2-1 cuda-rhel8-x86_64 289 M Transaction Summary ==================================================================================================== Install 4 Packages ... Is this ok [y/N]: yCode block. datacenter-gpu-manager-cuda installation result example
Install datacenter-gpu-manager-exporter (for RHEL)
This is a tool that collects various GPU metrics such as GPU usage, memory usage, temperature, and power consumption based on NVIDIA Data Center GPU Manager (DCGM) and exposes them for use in monitoring systems like Prometheus.
Add CUDA Repository to DNF. (If you have already run this command, proceed to the next step.)
Color modednf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repodnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repoCode block. Add DNF Repository Check CUDA version. (If you have already run this command, proceed to the next step.)
Color modenvidia-smi | grep CUDAnvidia-smi | grep CUDACode block. Check CUDA version Color mode| NVIDIA-SMI 535.183.06 Driver Version: 535.183.06 CUDA Version: 12.2 || NVIDIA-SMI 535.183.06 Driver Version: 535.183.06 CUDA Version: 12.2 |Code block. Check CUDA version result example Color modeCUDA_VERSION=12CUDA_VERSION=12Code block. Set CUDA version command Check datacenter-gpu-manager-exporter module list.
Color modednf list datacenter-gpu-manager-exporter --showduplicatesdnf list datacenter-gpu-manager-exporter --showduplicatesCode block. Check datacenter-gpu-manager-exporter module list Color modeUpdating Subscription Management repositories. Unable to read consumer identity This system is not registered with an entitlement server. You can use subscription-manager to register. Last metadata expiration check: 0:02:11 ago on Wed 19 Nov 2025 12:26:56 AM EST. Available Packages datacenter-gpu-manager-exporter.x86_64 4.0.1-1 cuda-rhel8-x86_64 datacenter-gpu-manager-exporter.x86_64 4.1.0-1 cuda-rhel8-x86_64 datacenter-gpu-manager-exporter.x86_64 4.1.1-1 cuda-rhel8-x86_64 datacenter-gpu-manager-exporter.x86_64 4.1.3-1 cuda-rhel8-x86_64 datacenter-gpu-manager-exporter.x86_64 4.5.0-1 cuda-rhel8-x86_64 datacenter-gpu-manager-exporter.x86_64 4.5.1-1 cuda-rhel8-x86_64 datacenter-gpu-manager-exporter.x86_64 4.5.2-1 cuda-rhel8-x86_64 datacenter-gpu-manager-exporter.x86_64 4.6.0-1 cuda-rhel8-x86_64Updating Subscription Management repositories. Unable to read consumer identity This system is not registered with an entitlement server. You can use subscription-manager to register. Last metadata expiration check: 0:02:11 ago on Wed 19 Nov 2025 12:26:56 AM EST. Available Packages datacenter-gpu-manager-exporter.x86_64 4.0.1-1 cuda-rhel8-x86_64 datacenter-gpu-manager-exporter.x86_64 4.1.0-1 cuda-rhel8-x86_64 datacenter-gpu-manager-exporter.x86_64 4.1.1-1 cuda-rhel8-x86_64 datacenter-gpu-manager-exporter.x86_64 4.1.3-1 cuda-rhel8-x86_64 datacenter-gpu-manager-exporter.x86_64 4.5.0-1 cuda-rhel8-x86_64 datacenter-gpu-manager-exporter.x86_64 4.5.1-1 cuda-rhel8-x86_64 datacenter-gpu-manager-exporter.x86_64 4.5.2-1 cuda-rhel8-x86_64 datacenter-gpu-manager-exporter.x86_64 4.6.0-1 cuda-rhel8-x86_64Code block. Check datacenter-gpu-manager-exporter module list result example Install datacenter-gpu-manager-cuda. dcgm-exporter 4.5.X requires glibc 2.34 or higher, but RHEL9 provides glibc 2.34, so specify version 4.1.3-1 to install.
Color modednf install datacenter-gpu-manager-exporter-4.1.3-1dnf install datacenter-gpu-manager-exporter-4.1.3-1Code block. datacenter-gpu-manager-cuda installation Color modeUpdating Subscription Management repositories. Unable to read consumer identity This system is not registered with an entitlement server. You can use subscription-manager to register. Last metadata expiration check: 0:07:12 ago on Wed 19 Nov 2025 12:26:56 AM EST. Dependencies resolved. ==================================================================================================== Package Architecture Version Repository Size ==================================================================================================== Installing: datacenter-gpu-manager-exporter x86_64 4.1.3-1 cuda-rhel8-x86_64 26 M ... Is this ok [y/N]: yUpdating Subscription Management repositories. Unable to read consumer identity This system is not registered with an entitlement server. You can use subscription-manager to register. Last metadata expiration check: 0:07:12 ago on Wed 19 Nov 2025 12:26:56 AM EST. Dependencies resolved. ==================================================================================================== Package Architecture Version Repository Size ==================================================================================================== Installing: datacenter-gpu-manager-exporter x86_64 4.1.3-1 cuda-rhel8-x86_64 26 M ... Is this ok [y/N]: yCode block. datacenter-gpu-manager-cuda installation result example Color modecat /usr/lib/systemd/system/nvidia-dcgm-exporter.service | grep ExecStartcat /usr/lib/systemd/system/nvidia-dcgm-exporter.service | grep ExecStartCode block. datacenter-gpu-manager-exporter configuration file Color modeExecStart=/usr/bin/dcgm-exporter -f /etc/dcgm-exporter/default-counters.csvExecStart=/usr/bin/dcgm-exporter -f /etc/dcgm-exporter/default-counters.csvCode block. Check datacenter-gpu-manager-exporter configuration file result example Check the configuration provided at DCGM Exporter installation, remove
#for necessary metrics, and add#for unnecessary metrics.Color modevi /etc/dcgm-exporter/default-counters.csv ## Example ## ... DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active. DCGM_FI_PROF_DRAM_ACTIVE, gauge, Ratio of cycles the device memory interface is active sending or receiving data. # DCGM_FI_PROF_PIPE_FP64_ACTIVE, gauge, Ratio of cycles the fp64 pipes are active. # DCGM_FI_PROF_PIPE_FP32_ACTIVE, gauge, Ratio of cycles the fp32 pipes are active. ...vi /etc/dcgm-exporter/default-counters.csv ## Example ## ... DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active. DCGM_FI_PROF_DRAM_ACTIVE, gauge, Ratio of cycles the device memory interface is active sending or receiving data. # DCGM_FI_PROF_PIPE_FP64_ACTIVE, gauge, Ratio of cycles the fp64 pipes are active. # DCGM_FI_PROF_PIPE_FP32_ACTIVE, gauge, Ratio of cycles the fp32 pipes are active. ...Code block. datacenter-gpu-manager-exporter metric configuration example ReferenceFor metrics that can be collected with the GPU DCGM Exporter and configuration methods, refer to DCGM Exporter Metrics.CautionMetric collection through ServiceWatch Agent is classified as custom metrics, and unlike basic metrics collected from each service, fees are charged. Therefore, unnecessary metric collection should be removed or disabled to avoid excessive charges.
Enable and Start DCGM Service (for RHEL)
Enable and start nvdia-dcgm service.
Color modesystemctl enable --now nvidia-dcgmsystemctl enable --now nvidia-dcgmCode block. Enable and start nvdia-dcgm service command Enable and start nvdia-dcgm-exporter service.
Color modesystemctl enable --now nvidia-dcgm-exportersystemctl enable --now nvidia-dcgm-exporterCode block. Enable and start nvdia-dcgm-exporter service command
For details, refer to ServiceWatch > Use ServiceWatch Agent.
DCGM Exporter Metrics
DCGM Exporter Key Metrics
The key GPU metrics provided by DCGM Exporter are as follows.
| Category | DCGM Field | Prometheus Metric Type | Summary | |
|---|---|---|---|---|
| Clocks | DCGM_FI_DEV_SM_CLOCK | gauge | SM clock frequency (in MHz) | |
| Clocks | DCGM_FI_DEV_MEM_CLOCK | gauge | Memory clock frequency (in MHz) | |
| Temperature | DCGM_FI_DEV_GPU_TEMP | gauge | GPU temperature (in C) | |
| Power | DCGM_FI_DEV_POWER_USAGE | gauge | Power draw (in W) | |
| Utilization | DCGM_FI_DEV_GPU_UTIL | gauge | GPU utilization (in %) | |
| Utilization | DCGM_FI_DEV_MEM_COPY_UTIL | gauge | Memory utilization (in %) | |
| Memory Usage | DCGM_FI_DEV_FB_FREE | gauge | Frame buffer memory free (in MiB) | |
| Memory Usage | DCGM_FI_DEV_FB_USED | gauge | Frame buffer memory used (in MiB) | |
| Nvlink | DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL(8 GPU only) | counter | Total number of NVLink bandwidth counters for all lanes |
DCGM Exporter Metrics Collection Configuration
For the metrics configured by default in DCGM Exporter, refer to DCGM Exporter > Default Metrics.
- For additional metrics to configure beyond the default settings, remove
#in default-counters.csv. - For metrics you do not want to collect among the default configured metrics, add
#or delete the item.
# Format
# If line starts with a '#' it is considered a comment
# DCGM FIELD, Prometheus metric type, help message
# Clocks
DCGM_FI_DEV_SM_CLOCK, gauge, SM clock frequency (in MHz).
DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).
# Temperature
DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).
DCGM_FI_DEV_GPU_TEMP, gauge, GPU temperature (in C).
# Power
DCGM_FI_DEV_POWER_USAGE, gauge, Power draw (in W).
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ).
# PCIE
# DCGM_FI_PROF_PCIE_TX_BYTES, counter, Total number of bytes transmitted through PCIe TX via NVML.
# DCGM_FI_PROF_PCIE_RX_BYTES, counter, Total number of bytes received through PCIe RX via NVML.
...# Format
# If line starts with a '#' it is considered a comment
# DCGM FIELD, Prometheus metric type, help message
# Clocks
DCGM_FI_DEV_SM_CLOCK, gauge, SM clock frequency (in MHz).
DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).
# Temperature
DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).
DCGM_FI_DEV_GPU_TEMP, gauge, GPU temperature (in C).
# Power
DCGM_FI_DEV_POWER_USAGE, gauge, Power draw (in W).
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ).
# PCIE
# DCGM_FI_PROF_PCIE_TX_BYTES, counter, Total number of bytes transmitted through PCIe TX via NVML.
# DCGM_FI_PROF_PCIE_RX_BYTES, counter, Total number of bytes received through PCIe RX via NVML.
...