The page has been translated by Gen AI.

ServiceWatch Agent Installation

Users can install ServiceWatch Agent on GPU Server to collect custom metrics and logs.

Reference
Custom metrics/log collection through ServiceWatch Agent is currently available only in Samsung Cloud Platform For Enterprise. It will be available in other offerings in the future.
Caution
Metric collection through ServiceWatch Agent is classified as custom metrics, and unlike basic metrics collected from each service, fees are charged. Therefore, it is recommended to remove or disable unnecessary metric collection settings.

ServiceWatch Agent

There are two main types of Agents that must be installed to collect ServiceWatch custom metrics and logs on GPU Server. Prometheus Exporter and Open Telemetry Collector.

CategoryDetailed Description
Prometheus ExporterProvides metrics of specific applications or services in a format that Prometheus can scrape
  • For collecting server OS metrics, you can use Node Exporter for Linux servers and Windows Exporter for Windows servers depending on the OS type.
Open Telemetry CollectorA centralized collector that collects telemetry data such as metrics and logs from distributed systems, processes (filtering, sampling, etc.), and sends them to multiple backends (e.g., Prometheus, Jaeger, Elasticsearch, etc.)
  • Sends data to ServiceWatch Gateway to enable ServiceWatch to collect metric and log data.
Table. Description of Prometheus Exporter and Open Telemetry Collector
Caution

If Kubernetes Engine is configured on GPU Server, please check GPU metrics through the metrics provided by Kubernetes Engine.

  • If you install DCGM Exporter on a GPU Server with Kubernetes Engine configured, it may not work properly.

Prerequisites for Using ServiceWatch Agent

Step 1. Check NVIDIA Driver installation

Install Prometheus Exporter for GPU Metrics (for Ubuntu)

Install the Prometheus Exporter to collect GPU Server metrics in the following order.

Check NVDIA Driver Installation

  • Check the installed NVDIA Driver.
    Color mode
    nvidia-smi --query-gpu driver_version --format csv
    nvidia-smi --query-gpu driver_version --format csv
    Code block. Check NVDIA Driver version command
    Color mode
    driver_version
    535.183.06
    ...
    535.183.06
    driver_version
    535.183.06
    ...
    535.183.06
    Code block. Check NVDIA Driver version example

Install NVSwitch Configuration and Query (NSCQ) Library

Reference
NVSwitch Configuration and Query (NSCQ) Library is required for Hopper or earlier Generation GPUs.
Notice
The installation commands below are possible in an environment where internet is available. If internet is not available, you must download libnvdia-nscq from https://developer.download.nvidia.com/compute/cuda/repos/ and upload it.
  1. Install cuda-keyring.

    Color mode
    wget https://developer.download.nvidia.com/compute/cuda/repos/<distro>/<arch>/cuda-keyring_1.1-1_all.deb
    wget https://developer.download.nvidia.com/compute/cuda/repos/<distro>/<arch>/cuda-keyring_1.1-1_all.deb
    Code block. NSCQ library download command
    Color mode
    sudo dpkg -i cuda-keyring_1.1-1_all.deb
    apt update
    sudo dpkg -i cuda-keyring_1.1-1_all.deb
    apt update
    Code block. NSCQ library installation command
    Color mode
    nvidia-smi --query-gpu driver_version --format csv
    nvidia-smi --query-gpu driver_version --format csv
    Code block. Check NVDIA Driver version command
    Color mode
    driver_version
    535.183.06
    ...
    
    535.183.06
    driver_version
    535.183.06
    ...
    
    535.183.06
    Code block. Check NVDIA Driver version example

  2. Install libnvidia-nscq.

    Color mode
    apt-cache policy libnvidia-nscq-535
    apt-cache policy libnvidia-nscq-535
    Code block. NSCQ library apt-cache command
    Color mode
    libnvidia-nscq-535:
      Installed: (none)
      Candidate: 535.247.01-1
      Version table:
         535.247.01-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
     ...
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         535.216.01-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         535.183.06-1 600  # Install the version that matches the Driver
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         535.183.01-1 600
     ...
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         535.54.03-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
    libnvidia-nscq-535:
      Installed: (none)
      Candidate: 535.247.01-1
      Version table:
         535.247.01-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
     ...
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         535.216.01-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         535.183.06-1 600  # Install the version that matches the Driver
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         535.183.01-1 600
     ...
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         535.54.03-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
    Code block. NSCQ library apt-cache command result
    Color mode
    apt install libnvidia-nscq-535=535.183.06-1
    apt install libnvidia-nscq-535=535.183.06-1
    Code block. NSCQ library installation command
    Notice

    You must install the same version as the NVDIA Driver version.

    • Example) driver version: 535.183.06, libnvdia-nscq version: 535.183.06-1

Install NVSwitch Device Monitoring API(NVSDM) Library

Reference
NVSDM Library installation is required for Blackwell and later GPU Architectures. NVSDM Library is not provided for NVDIA Driver version 560 or lower.
  • Install NVSDM library.
    Color mode
    apt-cache policy libnvsdm
    apt-cache policy libnvsdm
    Code block. NVSDM library apt-cache command
    Color mode
    libnvsdm:
      Installed: (none)
      Candidate: 580.105.08-1
      Version table:
         580.105.08-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         580.95.05-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         580.82.07-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         580.65.06-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
    libnvsdm:
      Installed: (none)
      Candidate: 580.105.08-1
      Version table:
         580.105.08-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         580.95.05-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         580.82.07-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         580.65.06-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
    Code block. NVSDM library apt-cache command result
    Color mode
    apt install libnvsdm=580.105.08-1
    apt install libnvsdm=580.105.08-1
    Code block. NVSDM library installation

Install NVIDIA DCGM (for Ubuntu)

Install DCGM Exporter in the following order.

  1. Install datacenter-gpu-manager-exporter
  2. Enable and start DCGM service
  3. For metrics that can be collected with GPU DCGM Exporter and configuration methods, refer to DCGM Exporter Metrics.

Install DCGM(datacenter-gpu-manager)

Refers to a specific version of NVIDIA’s Data Center GPU Manager (DCGM) tool, which is a package for managing and monitoring NVIDIA data center GPUs. In particular, cuda12 indicates that this management tool is installed for CUDA 12 version, and datacenter-gpu-manager-4 means DCGM version 4.x. This tool provides various functions including GPU status monitoring, diagnostics, alert systems, and power/clock management.

  1. Check CUDA version.
    Color mode
    nvidia-smi | grep CUDA
    nvidia-smi | grep CUDA
    Code block. Check CUDA version
    Color mode
    | NVIDIA-SMI 535.183.06             Driver Version: 535.183.06     CUDA Version: 12.2     |
    | NVIDIA-SMI 535.183.06             Driver Version: 535.183.06     CUDA Version: 12.2     |
    Code block. Check CUDA version result example
    Color mode
    CUDA_VERSION=12
    CUDA_VERSION=12
    Code block. Set CUDA version command
  2. Install datacenter-gpu-manager-cuda.
    Color mode
    apt install datacenter-gpu-manager-4-cuda${CUDA_VERSION}
    apt install datacenter-gpu-manager-4-cuda${CUDA_VERSION}
    Code block. datacenter-gpu-manager-cuda installation command

Install datacenter-gpu-manager-exporter

This is a tool that collects various GPU metrics such as GPU usage, memory usage, temperature, and power consumption based on NVIDIA Data Center GPU Manager (DCGM) and exposes them for use in monitoring systems like Prometheus.

  1. Install datacenter-gpu-manager-exporter.
    Color mode
    apt install datacenter-gpu-manager-exporter
    apt install datacenter-gpu-manager-exporter
    Code block. datacenter-gpu-manager-exporter installation command
  2. Check DCGM Exporter configuration file.
    Color mode
    cat /usr/lib/systemd/system/nvidia-dcgm-exporter.service | grep ExecStart
    cat /usr/lib/systemd/system/nvidia-dcgm-exporter.service | grep ExecStart
    Code block. Check datacenter-gpu-manager-exporter configuration file command
    Color mode
    ExecStart=/usr/bin/dcgm-exporter -f /etc/dcgm-exporter/default-counters.csv
    ExecStart=/usr/bin/dcgm-exporter -f /etc/dcgm-exporter/default-counters.csv
    Code block. Check datacenter-gpu-manager-exporter configuration file result example
  3. Check the configuration provided at DCGM Exporter installation, remove # for necessary metrics, and add # for unnecessary metrics.
    Color mode
    vi /etc/dcgm-exporter/default-counters.csv
    ## Example ##
    ...
    DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active.
    DCGM_FI_PROF_DRAM_ACTIVE,        gauge, Ratio of cycles the device memory interface is active sending or receiving data.
    # DCGM_FI_PROF_PIPE_FP64_ACTIVE,   gauge, Ratio of cycles the fp64 pipes are active.
    # DCGM_FI_PROF_PIPE_FP32_ACTIVE,   gauge, Ratio of cycles the fp32 pipes are active.
    ...
    vi /etc/dcgm-exporter/default-counters.csv
    ## Example ##
    ...
    DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active.
    DCGM_FI_PROF_DRAM_ACTIVE,        gauge, Ratio of cycles the device memory interface is active sending or receiving data.
    # DCGM_FI_PROF_PIPE_FP64_ACTIVE,   gauge, Ratio of cycles the fp64 pipes are active.
    # DCGM_FI_PROF_PIPE_FP32_ACTIVE,   gauge, Ratio of cycles the fp32 pipes are active.
    ...
    Code block. datacenter-gpu-manager-exporter metric configuration example
    Caution
    Metric collection through ServiceWatch Agent is classified as custom metrics, and unlike basic metrics collected from each service, fees are charged. Therefore, unnecessary metric collection should be removed or disabled to avoid excessive charges.

Enable and Start DCGM Service

  1. Enable and start nvdia-dcgm service.

    Color mode
    systemctl enable --now nvidia-dcgm
    systemctl enable --now nvidia-dcgm
    Code block. Enable and start nvdia-dcgm service command

  2. Enable and start nvdia-dcgm-exporter service.

    Color mode
    systemctl enable --now nvidia-dcgm-exporter
    systemctl enable --now nvidia-dcgm-exporter
    Code block. Enable and start nvdia-dcgm-exporter service command

Notice
If you have completed DCGM Exporter configuration, you must install the Open Telemetry Collector provided by ServiceWatch to complete ServiceWatch Agent configuration.
For details, refer to ServiceWatch > Use ServiceWatch Agent.

Install Prometheus Exporter for GPU Metrics (for RHEL)

Install ServiceWatch Agent to collect GPU Server metrics in the following order.

Check NVDIA Driver Installation (for RHEL)

  1. Check the installed NVDIA Driver.
    Color mode
    nvidia-smi --query-gpu driver_version --format csv
    nvidia-smi --query-gpu driver_version --format csv
    Code block. Check NVDIA Driver version command
    Color mode
    driver_version
    535.183.06
    ...
    
    535.183.06
    driver_version
    535.183.06
    ...
    
    535.183.06
    Code block. Check NVDIA Driver version example

Install NVSwitch Configuration and Query (NSCQ) Library (for RHEL)

Reference

NVSwitch Configuration and Query (NSCQ) Library is required for Hopper or earlier Generation GPUs.

  • For RHEL, check if libnvdia-nscq is installed and then install it.
Notice
The installation commands below are possible in an environment where internet is available. If internet is not available, you must download libnvdia-nscq from https://developer.download.nvidia.com/compute/cuda/repos/ and upload it.
  1. Check libnvdia-nscq package.

    Color mode
    rpm -qa | grep libnvidia-nscq libnvidia-nscq-535-535.183.06-1.x86_64
    rpm -qa | grep libnvidia-nscq libnvidia-nscq-535-535.183.06-1.x86_64
    Code block. Check NSCQ library package

  2. Add CUDA Repository to DNF.

    Color mode
    dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
    dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
    Code block. Add DNF Repository

  3. Reset NVDIA Driver status

    Color mode
    dnf module reset nvidia-driver
    dnf module reset nvidia-driver
    Code block. Reset NVIDIA Driver DNF module status
    Color mode
    Updating Subscription Management repositories.
    Last metadata expiration check: 0:03:15 ago on Wed 19 Nov 2025 01:23:48 AM EST.
    Dependencies resolved.
    =============================================
    Package Architecture Version Repository Size
    =============================================
    Disabling module profiles:
    nvidia-driver/default
    nvidia-driver/fm
    Resetting modules:
    nvidia-driver
    
    Transaction Summary
    =============================================
    
    Is this ok [y/N]: y
    Updating Subscription Management repositories.
    Last metadata expiration check: 0:03:15 ago on Wed 19 Nov 2025 01:23:48 AM EST.
    Dependencies resolved.
    =============================================
    Package Architecture Version Repository Size
    =============================================
    Disabling module profiles:
    nvidia-driver/default
    nvidia-driver/fm
    Resetting modules:
    nvidia-driver
    
    Transaction Summary
    =============================================
    
    Is this ok [y/N]: y
    Code block. Reset NVIDIA Driver DNF module status result example

  4. Enable NVDIA Driver module.

    Color mode
    dnf module enable nvidia-driver:535-open
    dnf module enable nvidia-driver:535-open
    Code block. Enable NVDIA Driver module
    Color mode
    Updating Subscription Management repositories.
    Last metadata expiration check: 0:04:22 ago on Wed 19 Nov 2025 01:23:48 AM EST.
    Dependencies resolved.
    =============================================
    Package Architecture Version Repository Size
    =============================================
    Enabling module streams:
    nvidia-driver 535-open
    
    Transaction Summary
    =============================================
    
    Is this ok [y/N]: y
    Updating Subscription Management repositories.
    Last metadata expiration check: 0:04:22 ago on Wed 19 Nov 2025 01:23:48 AM EST.
    Dependencies resolved.
    =============================================
    Package Architecture Version Repository Size
    =============================================
    Enabling module streams:
    nvidia-driver 535-open
    
    Transaction Summary
    =============================================
    
    Is this ok [y/N]: y
    Code block. Enable NVDIA Driver module result example

  5. Check libnvdia-nscq module list.

    Color mode
    dnf list libnvidia-nscq-535 --showduplicates
    dnf list libnvidia-nscq-535 --showduplicates
    Code block. Check libnvdia-nscq module list

  6. Install libnvdia-nscq.

    Color mode
    dnf install libnvidia-nscq-535-535.183.06-1
    dnf install libnvidia-nscq-535-535.183.06-1
    Code block. libnvdia-nscq installation command

Install NVSwitch Device Monitoring API(NVSDM) Library (for RHEL)

Reference
NVSDM Library installation is required for Blackwell and later GPU Architectures. NVSDM Library is not provided for NVDIA Driver version 560 or lower.
  1. Check NVSDM library module list.

    Color mode
    dnf list libnvsdm --showduplicates
    dnf list libnvsdm --showduplicates
    Code block. Check NVSDM library module list
    Color mode
    libnvsdm:
      Installed: (none)
      Candidate: 580.105.08-1
      Version table:
         580.105.08-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         580.95.05-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         580.82.07-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         580.65.06-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
    libnvsdm:
      Installed: (none)
      Candidate: 580.105.08-1
      Version table:
         580.105.08-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         580.95.05-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         580.82.07-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         580.65.06-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
    Code block. Check NVSDM library module list result example

  2. Install libnvsdm.

    Color mode
    dnf install libnvsdm-580.105.08-1
    dnf install libnvsdm-580.105.08-1
    Code block. NVSDM library installation
    Color mode
    Updating Subscription Management repositories.
    Last metadata expiration check: 0:08:18 ago on Wed 19 Nov 2025 01:05:28 AM EST.
    Dependencies resolved.
    ========================================================================
    Package Architecture Version Repository Size
    ========================================================================
    Installing:
    libnvsdm x86_64 580.105.08-1 cuda-rhel8-x86_64 675 k
    Installing dependencies:
    infiniband-diags x86_64 48.0-1.el8 rhel-8-for-x86_64-baseos-rpms 323 k
    libibumad x86_64 48.0-1.el8 rhel-8-for-x86_64-baseos-rpms 34 k
    
    Transaction Summary
    ========================================================================
    Install 3 Packages
    
    Total download size: 1.0 M
    Installed size: 3.2 M
    Is this ok [y/N]: y
    Updating Subscription Management repositories.
    Last metadata expiration check: 0:08:18 ago on Wed 19 Nov 2025 01:05:28 AM EST.
    Dependencies resolved.
    ========================================================================
    Package Architecture Version Repository Size
    ========================================================================
    Installing:
    libnvsdm x86_64 580.105.08-1 cuda-rhel8-x86_64 675 k
    Installing dependencies:
    infiniband-diags x86_64 48.0-1.el8 rhel-8-for-x86_64-baseos-rpms 323 k
    libibumad x86_64 48.0-1.el8 rhel-8-for-x86_64-baseos-rpms 34 k
    
    Transaction Summary
    ========================================================================
    Install 3 Packages
    
    Total download size: 1.0 M
    Installed size: 3.2 M
    Is this ok [y/N]: y
    Code block. NVSDM library installation command result example

Install NVIDIA DCGM (for RHEL)

Install Node Exporter in the following order.

  1. Install datacenter-gpu-manager-exporter
  2. Enable and start DCGM service
  3. For metrics that can be collected with GPU DCGM Exporter and configuration methods, refer to DCGM Exporter Metrics.

Install DCGM(datacenter-gpu-manager) (for RHEL)

Refers to a specific version of NVIDIA’s Data Center GPU Manager (DCGM) tool, which is a package for managing and monitoring NVIDIA data center GPUs. In particular, cuda12 indicates that this management tool is installed for CUDA 12 version, and datacenter-gpu-manager-4 means DCGM version 4.x. This tool provides various functions including GPU status monitoring, diagnostics, alert systems, and power/clock management.

  1. Add CUDA Repository to DNF.
    Color mode
    dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
    dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
    Code block. Add DNF Repository
  2. Check CUDA version.
    Color mode
    nvidia-smi | grep CUDA
    nvidia-smi | grep CUDA
    Code block. Check CUDA version
    Color mode
    | NVIDIA-SMI 535.183.06             Driver Version: 535.183.06     CUDA Version: 12.2     |
    | NVIDIA-SMI 535.183.06             Driver Version: 535.183.06     CUDA Version: 12.2     |
    Code block. Check CUDA version result example
    Color mode
    CUDA_VERSION=12
    CUDA_VERSION=12
    Code block. Set CUDA version command
  3. Check datacenter-gpu-manager-cuda module list.
    Color mode
    dnf list datacenter-gpu-manager-4-cuda${CUDA_VERSION} --showduplicates
    dnf list datacenter-gpu-manager-4-cuda${CUDA_VERSION} --showduplicates
    Code block. Check datacenter-gpu-manager-cuda module list
    Color mode
    Updating Subscription Management repositories.
    Unable to read consumer identity
    
    This system is not registered with an entitlement server. You can use subscription-manager to register.
    
    Last metadata expiration check: 0:00:34 ago on Wed 19 Nov 2025 12:26:56 AM EST.
    Available Packages
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.0.0-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.1.0-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.1.1-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.2.0-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.2.2-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.2.3-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.2.3-2    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.3.0-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.3.1-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.4.0-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.4.1-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.4.2-1    cuda-rhel8-x86_64
    Updating Subscription Management repositories.
    Unable to read consumer identity
    
    This system is not registered with an entitlement server. You can use subscription-manager to register.
    
    Last metadata expiration check: 0:00:34 ago on Wed 19 Nov 2025 12:26:56 AM EST.
    Available Packages
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.0.0-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.1.0-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.1.1-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.2.0-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.2.2-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.2.3-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.2.3-2    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.3.0-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.3.1-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.4.0-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.4.1-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.4.2-1    cuda-rhel8-x86_64
    Code block. Check datacenter-gpu-manager-cuda module list result example
  4. Install datacenter-gpu-manager-cuda.
    Color mode
    dnf install datacenter-gpu-manager-4-cuda${CUDA_VERSION}
    dnf install datacenter-gpu-manager-4-cuda${CUDA_VERSION}
    Code block. datacenter-gpu-manager-cuda installation
    Color mode
    Updating Subscription Management repositories.
    Unable to read consumer identity
    
    This system is not registered with an entitlement server. You can use subscription-manager to register.
    
    Last metadata expiration check: 0:07:12 ago on Wed 19 Nov 2025 12:26:56 AM EST.
    Dependencies resolved.
    ===================================================================================================
     Package                                       Architecture   Version     Repository          Size
    ===================================================================================================
    Installing:
     datacenter-gpu-manager-4-cuda12               x86_64         1:4.4.2-1   cuda-rhel8-x86_64   554 M
    Installing dependencies:
     datacenter-gpu-manager-4-core                 x86_64         1:4.4.2-1   cuda-rhel8-x86_64   9.9 M
    Installing weak dependencies:
     datacenter-gpu-manager-4-proprietary          x86_64         1:4.4.2-1   cuda-rhel8-x86_64   5.3 M
     datacenter-gpu-manager-4-proprietary-cuda12   x86_64         1:4.4.2-1   cuda-rhel8-x86_64   289 M
    
    Transaction Summary
    ====================================================================================================
    Install  4 Packages
    ...
    Is this ok [y/N]: y
    Updating Subscription Management repositories.
    Unable to read consumer identity
    
    This system is not registered with an entitlement server. You can use subscription-manager to register.
    
    Last metadata expiration check: 0:07:12 ago on Wed 19 Nov 2025 12:26:56 AM EST.
    Dependencies resolved.
    ===================================================================================================
     Package                                       Architecture   Version     Repository          Size
    ===================================================================================================
    Installing:
     datacenter-gpu-manager-4-cuda12               x86_64         1:4.4.2-1   cuda-rhel8-x86_64   554 M
    Installing dependencies:
     datacenter-gpu-manager-4-core                 x86_64         1:4.4.2-1   cuda-rhel8-x86_64   9.9 M
    Installing weak dependencies:
     datacenter-gpu-manager-4-proprietary          x86_64         1:4.4.2-1   cuda-rhel8-x86_64   5.3 M
     datacenter-gpu-manager-4-proprietary-cuda12   x86_64         1:4.4.2-1   cuda-rhel8-x86_64   289 M
    
    Transaction Summary
    ====================================================================================================
    Install  4 Packages
    ...
    Is this ok [y/N]: y
    Code block. datacenter-gpu-manager-cuda installation result example

Install datacenter-gpu-manager-exporter (for RHEL)

This is a tool that collects various GPU metrics such as GPU usage, memory usage, temperature, and power consumption based on NVIDIA Data Center GPU Manager (DCGM) and exposes them for use in monitoring systems like Prometheus.

  1. Add CUDA Repository to DNF. (If you have already run this command, proceed to the next step.)

    Color mode
    dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
    dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
    Code block. Add DNF Repository

  2. Check CUDA version. (If you have already run this command, proceed to the next step.)

    Color mode
    nvidia-smi | grep CUDA
    nvidia-smi | grep CUDA
    Code block. Check CUDA version
    Color mode
    | NVIDIA-SMI 535.183.06             Driver Version: 535.183.06     CUDA Version: 12.2     |
    | NVIDIA-SMI 535.183.06             Driver Version: 535.183.06     CUDA Version: 12.2     |
    Code block. Check CUDA version result example
    Color mode
    CUDA_VERSION=12
    CUDA_VERSION=12
    Code block. Set CUDA version command

  3. Check datacenter-gpu-manager-exporter module list.

    Color mode
    dnf list datacenter-gpu-manager-exporter --showduplicates
    dnf list datacenter-gpu-manager-exporter --showduplicates
    Code block. Check datacenter-gpu-manager-exporter module list
    Color mode
    Updating Subscription Management repositories.
    Unable to read consumer identity
    
    This system is not registered with an entitlement server. You can use subscription-manager to register.
    
    Last metadata expiration check: 0:02:11 ago on Wed 19 Nov 2025 12:26:56 AM EST.
    Available Packages
    datacenter-gpu-manager-exporter.x86_64   4.0.1-1   cuda-rhel8-x86_64
    datacenter-gpu-manager-exporter.x86_64   4.1.0-1   cuda-rhel8-x86_64
    datacenter-gpu-manager-exporter.x86_64   4.1.1-1   cuda-rhel8-x86_64
    datacenter-gpu-manager-exporter.x86_64   4.1.3-1   cuda-rhel8-x86_64
    datacenter-gpu-manager-exporter.x86_64   4.5.0-1   cuda-rhel8-x86_64
    datacenter-gpu-manager-exporter.x86_64   4.5.1-1   cuda-rhel8-x86_64
    datacenter-gpu-manager-exporter.x86_64   4.5.2-1   cuda-rhel8-x86_64
    datacenter-gpu-manager-exporter.x86_64   4.6.0-1   cuda-rhel8-x86_64
    Updating Subscription Management repositories.
    Unable to read consumer identity
    
    This system is not registered with an entitlement server. You can use subscription-manager to register.
    
    Last metadata expiration check: 0:02:11 ago on Wed 19 Nov 2025 12:26:56 AM EST.
    Available Packages
    datacenter-gpu-manager-exporter.x86_64   4.0.1-1   cuda-rhel8-x86_64
    datacenter-gpu-manager-exporter.x86_64   4.1.0-1   cuda-rhel8-x86_64
    datacenter-gpu-manager-exporter.x86_64   4.1.1-1   cuda-rhel8-x86_64
    datacenter-gpu-manager-exporter.x86_64   4.1.3-1   cuda-rhel8-x86_64
    datacenter-gpu-manager-exporter.x86_64   4.5.0-1   cuda-rhel8-x86_64
    datacenter-gpu-manager-exporter.x86_64   4.5.1-1   cuda-rhel8-x86_64
    datacenter-gpu-manager-exporter.x86_64   4.5.2-1   cuda-rhel8-x86_64
    datacenter-gpu-manager-exporter.x86_64   4.6.0-1   cuda-rhel8-x86_64
    Code block. Check datacenter-gpu-manager-exporter module list result example

  4. Install datacenter-gpu-manager-cuda. dcgm-exporter 4.5.X requires glibc 2.34 or higher, but RHEL9 provides glibc 2.34, so specify version 4.1.3-1 to install.

    Color mode
    dnf install datacenter-gpu-manager-exporter-4.1.3-1
    dnf install datacenter-gpu-manager-exporter-4.1.3-1
    Code block. datacenter-gpu-manager-cuda installation
    Color mode
    Updating Subscription Management repositories.
    Unable to read consumer identity
    
    This system is not registered with an entitlement server. You can use subscription-manager to register.
    
    Last metadata expiration check: 0:07:12 ago on Wed 19 Nov 2025 12:26:56 AM EST.
    Dependencies resolved.
    ====================================================================================================
     Package                                       Architecture   Version     Repository          Size
    ====================================================================================================
    Installing:
     datacenter-gpu-manager-exporter               x86_64         4.1.3-1     cuda-rhel8-x86_64   26 M
    
    ...
    Is this ok [y/N]: y
    Updating Subscription Management repositories.
    Unable to read consumer identity
    
    This system is not registered with an entitlement server. You can use subscription-manager to register.
    
    Last metadata expiration check: 0:07:12 ago on Wed 19 Nov 2025 12:26:56 AM EST.
    Dependencies resolved.
    ====================================================================================================
     Package                                       Architecture   Version     Repository          Size
    ====================================================================================================
    Installing:
     datacenter-gpu-manager-exporter               x86_64         4.1.3-1     cuda-rhel8-x86_64   26 M
    
    ...
    Is this ok [y/N]: y
    Code block. datacenter-gpu-manager-cuda installation result example
    Color mode
    cat /usr/lib/systemd/system/nvidia-dcgm-exporter.service | grep ExecStart
    cat /usr/lib/systemd/system/nvidia-dcgm-exporter.service | grep ExecStart
    Code block. datacenter-gpu-manager-exporter configuration file
    Color mode
    ExecStart=/usr/bin/dcgm-exporter -f /etc/dcgm-exporter/default-counters.csv
    ExecStart=/usr/bin/dcgm-exporter -f /etc/dcgm-exporter/default-counters.csv
    Code block. Check datacenter-gpu-manager-exporter configuration file result example

  5. Check the configuration provided at DCGM Exporter installation, remove # for necessary metrics, and add # for unnecessary metrics.

    Color mode
    vi /etc/dcgm-exporter/default-counters.csv
    ## Example ##
    ...
    DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active.
    DCGM_FI_PROF_DRAM_ACTIVE,        gauge, Ratio of cycles the device memory interface is active sending or receiving data.
    # DCGM_FI_PROF_PIPE_FP64_ACTIVE,   gauge, Ratio of cycles the fp64 pipes are active.
    # DCGM_FI_PROF_PIPE_FP32_ACTIVE,   gauge, Ratio of cycles the fp32 pipes are active.
    ...
    vi /etc/dcgm-exporter/default-counters.csv
    ## Example ##
    ...
    DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active.
    DCGM_FI_PROF_DRAM_ACTIVE,        gauge, Ratio of cycles the device memory interface is active sending or receiving data.
    # DCGM_FI_PROF_PIPE_FP64_ACTIVE,   gauge, Ratio of cycles the fp64 pipes are active.
    # DCGM_FI_PROF_PIPE_FP32_ACTIVE,   gauge, Ratio of cycles the fp32 pipes are active.
    ...
    Code block. datacenter-gpu-manager-exporter metric configuration example
    Reference
    For metrics that can be collected with the GPU DCGM Exporter and configuration methods, refer to DCGM Exporter Metrics.
    Caution
    Metric collection through ServiceWatch Agent is classified as custom metrics, and unlike basic metrics collected from each service, fees are charged. Therefore, unnecessary metric collection should be removed or disabled to avoid excessive charges.

Enable and Start DCGM Service (for RHEL)

  1. Enable and start nvdia-dcgm service.

    Color mode
    systemctl enable --now nvidia-dcgm
    systemctl enable --now nvidia-dcgm
    Code block. Enable and start nvdia-dcgm service command

  2. Enable and start nvdia-dcgm-exporter service.

    Color mode
    systemctl enable --now nvidia-dcgm-exporter
    systemctl enable --now nvidia-dcgm-exporter
    Code block. Enable and start nvdia-dcgm-exporter service command

Notice
If you have completed DCGM Exporter configuration, you must install the Open Telemetry Collector provided by ServiceWatch to complete ServiceWatch Agent configuration.
For details, refer to ServiceWatch > Use ServiceWatch Agent.

DCGM Exporter Metrics

DCGM Exporter Key Metrics

The key GPU metrics provided by DCGM Exporter are as follows.

CategoryDCGM FieldPrometheus Metric TypeSummary
ClocksDCGM_FI_DEV_SM_CLOCKgaugeSM clock frequency (in MHz)
ClocksDCGM_FI_DEV_MEM_CLOCKgaugeMemory clock frequency (in MHz)
TemperatureDCGM_FI_DEV_GPU_TEMPgaugeGPU temperature (in C)
PowerDCGM_FI_DEV_POWER_USAGEgaugePower draw (in W)
UtilizationDCGM_FI_DEV_GPU_UTILgaugeGPU utilization (in %)
UtilizationDCGM_FI_DEV_MEM_COPY_UTILgaugeMemory utilization (in %)
Memory UsageDCGM_FI_DEV_FB_FREEgaugeFrame buffer memory free (in MiB)
Memory UsageDCGM_FI_DEV_FB_USEDgaugeFrame buffer memory used (in MiB)
NvlinkDCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL(8 GPU only)counterTotal number of NVLink bandwidth counters for all lanes
Table. Key GPU metrics provided by DCGM Exporter

DCGM Exporter Metrics Collection Configuration

For the metrics configured by default in DCGM Exporter, refer to DCGM Exporter > Default Metrics.

  • For additional metrics to configure beyond the default settings, remove # in default-counters.csv.
  • For metrics you do not want to collect among the default configured metrics, add # or delete the item.
Color mode
# Format
# If line starts with a '#' it is considered a comment
# DCGM FIELD, Prometheus metric type, help message

# Clocks
DCGM_FI_DEV_SM_CLOCK,  gauge, SM clock frequency (in MHz).
DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).

# Temperature
DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).
DCGM_FI_DEV_GPU_TEMP,    gauge, GPU temperature (in C).

# Power
DCGM_FI_DEV_POWER_USAGE,              gauge, Power draw (in W).
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ).

# PCIE
# DCGM_FI_PROF_PCIE_TX_BYTES,  counter, Total number of bytes transmitted through PCIe TX via NVML.
# DCGM_FI_PROF_PCIE_RX_BYTES,  counter, Total number of bytes received through PCIe RX via NVML.
...
# Format
# If line starts with a '#' it is considered a comment
# DCGM FIELD, Prometheus metric type, help message

# Clocks
DCGM_FI_DEV_SM_CLOCK,  gauge, SM clock frequency (in MHz).
DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).

# Temperature
DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).
DCGM_FI_DEV_GPU_TEMP,    gauge, GPU temperature (in C).

# Power
DCGM_FI_DEV_POWER_USAGE,              gauge, Power draw (in W).
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ).

# PCIE
# DCGM_FI_PROF_PCIE_TX_BYTES,  counter, Total number of bytes transmitted through PCIe TX via NVML.
# DCGM_FI_PROF_PCIE_RX_BYTES,  counter, Total number of bytes received through PCIe RX via NVML.
...
Code block. default-counters.csv configuration example
Keypair Management
Release Note