The page has been translated by Gen AI.

ServiceWatch Agent Install

Users can install the ServiceWatch Agent on the GPU Server to collect custom metrics and logs.

Reference
Custom metric/log collection via the ServiceWatch Agent is currently only available on Samsung Cloud Platform For Enterprise. It will be offered in other offerings in the future.
Caution
Metric collection via ServiceWatch Agent is classified as custom metrics and, unlike the default metrics collected from each service, incurs charges, so it is recommended to remove or disable unnecessary metric collection settings.

ServiceWatch Agent

The agents that need to be installed on the GPU Server for collecting ServiceWatch’s custom metrics and logs can be divided into two main types. It is Prometheus Exporter and Open Telemetry Collector.

CategoryDetailed description
Prometheus ExporterProvides metrics of a specific application or service in a format that Prometheus can scrape
  • For collecting server OS metrics, you can use Node Exporter for Linux servers and Windows Exporter for Windows servers, depending on the OS type.
Open Telemetry CollectorCollects telemetry data such as metrics and logs from distributed systems, processes (filtering, sampling, etc.) them, and acts as a centralized collector that exports to various backends (e.g., Prometheus, Jaeger, Elasticsearch, etc.)
  • Exports data to ServiceWatch Gateway so that ServiceWatch can collect metric and log data.
Table. Description of Prometheus Exporter and Open Telemetry Collector
Caution

If you have configured Kubernetes Engine on a GPU Server, please check GPU metrics through the metrics provided by Kubernetes Engine.

  • If you install the DCGM Exporter on a GPU server configured with Kubernetes Engine, it may not work properly.

Install Prometheus Exporter for GPU metrics (for Ubuntu)

Install the Prometheus Exporter to collect metrics of the GPU Server according to the steps below.

NVDIA Driver Installation Check

  • Check the installed NVDIA Driver.
    Color mode
    nvidia-smi --query-gpu driver_version --format csv
    nvidia-smi --query-gpu driver_version --format csv
    Code block. NVDIA Driver version check command
    Color mode
    driver_version
    535.183.06
    ...
    535.183.06
    driver_version
    535.183.06
    ...
    535.183.06
    Code block. NVDIA Driver version check example

NVSwitch Configuration and Query (NSCQ) Library Installation

Reference
NVSwitch Configuration and Query (NSCQ) Library is required for Hopper or earlier Generation GPUs.
Notice
The installation commands below are possible in an environment where the internet is available. If you are in an environment without internet, you need to download libnvdia-nscq from https://developer.download.nvidia.com/compute/cuda/repos/ and upload it.
  1. Install cuda-keyring.

    Color mode
    wget https://developer.download.nvidia.com/compute/cuda/repos/<distro>/<arch>/cuda-keyring_1.1-1_all.deb
    wget https://developer.download.nvidia.com/compute/cuda/repos/<distro>/<arch>/cuda-keyring_1.1-1_all.deb
    Code block. NSCQ library download command
    Color mode
    sudo dpkg -i cuda-keyring_1.1-1_all.deb
    apt update
    sudo dpkg -i cuda-keyring_1.1-1_all.deb
    apt update
    Code block. NSCQ library installation command
    Color mode
    nvidia-smi --query-gpu driver_version --format csv
    nvidia-smi --query-gpu driver_version --format csv
    Code block. NVDIA Driver version check command
    Color mode
    driver_version
    535.183.06
    ...
    
    535.183.06
    driver_version
    535.183.06
    ...
    
    535.183.06
    Code block. NVDIA Driver version check example

  2. Install libnvidia-nscq.

    Color mode
    apt-cache policy libnvidia-nscq-535
    apt-cache policy libnvidia-nscq-535
    Code block. NSCQ library apt-cache command
    Color mode
    libnvidia-nscq-535:
      Installed: (none)
      Candidate: 535.247.01-1
      Version table:
         535.247.01-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
    ...
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         535.216.01-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         535.183.06-1 600  # Install version matching the Driver
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         535.183.01-1 600
    ...
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         535.54.03-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
    libnvidia-nscq-535:
      Installed: (none)
      Candidate: 535.247.01-1
      Version table:
         535.247.01-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
    ...
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         535.216.01-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         535.183.06-1 600  # Install version matching the Driver
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         535.183.01-1 600
    ...
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         535.54.03-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
    Code block. NSCQ library apt-cache command result
    Color mode
    apt install libnvidia-nscq-535=535.183.06-1
    apt install libnvidia-nscq-535=535.183.06-1
    Code block. NSCQ library installation command

Notice

It must be installed with the same version as the NVDIA Driver version.

  • Example) driver version: 535.183.06, libnvdia-nscq version: 535.183.06-1

NVSwitch Device Monitoring API(NVSDM) Library Installation

Note
After Blackwell GPU Architecture, installation of the NVSDM Library is required. NVIDIA Driver versions 560 and below do not provide the NVSDM Library.
  • NVSDM library install.
    Color mode
    apt-cache policy libnvsdm
    apt-cache policy libnvsdm
    Code block. NVSDM library apt-cache command
    Color mode
    libnvsdm:
      Installed: (none)
      Candidate: 580.105.08-1
      Version table:
         580.105.08-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         580.95.05-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         580.82.07-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         580.65.06-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
    libnvsdm:
      Installed: (none)
      Candidate: 580.105.08-1
      Version table:
         580.105.08-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         580.95.05-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         580.82.07-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         580.65.06-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
    Code block. NVSDM library apt-cache command result
    Color mode
    apt install libnvsdm=580.105.08-1
    apt install libnvsdm=580.105.08-1
    Code block. NVSDM library installation

NVIDIA DCGM Installation (for Ubuntu)

Install the DCGM Exporter according to the following steps.

  1. DCGM(datacenter-gpu-manager) Installation
  2. datacenter-gpu-manager-exporter Installation
  3. DCGM Service Activation and Start

DCGM(datacenter-gpu-manager) Installation

It refers to a specific version of NVIDIA’s Data Center GPU Manager (DCGM) tool, which is a package for managing and monitoring NVIDIA data center GPUs. Specifically, cuda12 indicates that this management tool is installed for CUDA version 12, and datacenter‑gpu‑manager‑4 denotes the 4.x version of DCGM. This tool provides various features including GPU status monitoring, diagnostics, alert system, and power/clock management.

  1. Check the CUDA version.
    Color mode
    nvidia-smi | grep CUDA
    nvidia-smi | grep CUDA
    Code block. Check CUDA version
    Color mode
    | NVIDIA-SMI 535.183.06             Driver Version: 535.183.06     CUDA Version: 12.2     |
    | NVIDIA-SMI 535.183.06             Driver Version: 535.183.06     CUDA Version: 12.2     |
    Code block. Example of CUDA version check result
    Color mode
    CUDA_VERSION=12
    CUDA_VERSION=12
    Code block. CUDA version setting command
  2. Install datacenter-gpu-manager-cuda.
    Color mode
    apt install datacenter-gpu-manager-4-cuda${CUDA_VERSION}
    apt install datacenter-gpu-manager-4-cuda${CUDA_VERSION}
    Code block. datacenter-gpu-manager-cuda installation command

datacenter-gpu-manager-exporter installation

Based on NVIDIA Data Center GPU Manager (DCGM), it is a tool that collects various GPU metrics such as GPU usage, memory usage, temperature, and power consumption, and exposes them so they can be used in monitoring systems like Prometheus.

  1. Install datacenter-gpu-manager-exporter.
    Color mode
    apt install datacenter-gpu-manager-exporter
    apt install datacenter-gpu-manager-exporter
    Code block. datacenter-gpu-manager-exporter installation command
  2. Check the DCGM Exporter configuration file.
    Color mode
    cat /usr/lib/systemd/system/nvidia-dcgm-exporter.service | grep ExecStart
    cat /usr/lib/systemd/system/nvidia-dcgm-exporter.service | grep ExecStart
    Code block. datacenter-gpu-manager-exporter configuration file verification command
    Color mode
    ExecStart=/usr/bin/dcgm-exporter -f /etc/dcgm-exporter/default-counters.csv
    ExecStart=/usr/bin/dcgm-exporter -f /etc/dcgm-exporter/default-counters.csv
    Code block. datacenter-gpu-manager-exporter configuration file check result example
  3. When installing DCGM Exporter, check the provided settings and remove # from the required metrics, and add # to the unnecessary metrics.
    Color mode
    vi /etc/dcgm-exporter/default-counters.csv
    ## Example ##
    ...
    DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active.
    DCGM_FI_PROF_DRAM_ACTIVE,        gauge, Ratio of cycles the device memory interface is active sending or receiving data.
    # DCGM_FI_PROF_PIPE_FP64_ACTIVE,   gauge, Ratio of cycles the fp64 pipes are active.
    # DCGM_FI_PROF_PIPE_FP32_ACTIVE,   gauge, Ratio of cycles the fp32 pipes are active.
    ...
    vi /etc/dcgm-exporter/default-counters.csv
    ## Example ##
    ...
    DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active.
    DCGM_FI_PROF_DRAM_ACTIVE,        gauge, Ratio of cycles the device memory interface is active sending or receiving data.
    # DCGM_FI_PROF_PIPE_FP64_ACTIVE,   gauge, Ratio of cycles the fp64 pipes are active.
    # DCGM_FI_PROF_PIPE_FP32_ACTIVE,   gauge, Ratio of cycles the fp32 pipes are active.
    ...
    Code block. datacenter-gpu-manager-exporter metric setting example
Reference
For the metrics that can be collected with GPU DCGM Exporter and how to configure them, see DCGM Exporter metrics.
Caution
Since metric collection via ServiceWatch Agent is classified as custom metrics and incurs charges unlike the default collected metrics, unnecessary metric collection must be removed or disabled to avoid excessive charges.

DCGM Service activation and start

  1. Activate and start the nvdia-dcgm service.

    Color mode
    systemctl enable --now nvidia-dcgm
    systemctl enable --now nvidia-dcgm
    Code block. nvdia-dcgm service activation and start command

  2. Activate and start the nvdia-dcgm-exporter service.

    Color mode
    systemctl enable --now nvidia-dcgm-exporter
    systemctl enable --now nvidia-dcgm-exporter
    Code block. nvdia-dcgm-exporter service activation and start command

Notice
If you have completed the DCGM Exporter setup, you need to install the Open Telemetry Collector provided by ServiceWatch to complete the ServiceWatch Agent configuration.
For more details, see ServiceWatch > ServiceWatch Agent Using.

Installation of Prometheus Exporter for GPU metrics (for RHEL)

Install the ServiceWatch Agent according to the steps below to collect metrics of the GPU Server.

NVDIA Driver Installation Check (for RHEL)

  1. Check the installed NVDIA Driver.
    Color mode
    nvidia-smi --query-gpu driver_version --format csv
    nvidia-smi --query-gpu driver_version --format csv
    Code block. NVDIA Driver version check command
    Color mode
    driver_version
    535.183.06
    ...
    
    535.183.06
    driver_version
    535.183.06
    ...
    
    535.183.06
    Code block. NVDIA Driver version check example

NVSwitch Configuration and Query (NSCQ) Library installation (for RHEL)

Reference

NVSwitch Configuration and Query (NSCQ) Library is required for Hopper or earlier Generation GPUs.

  • For RHEL, check if libnvdia-nscq is installed and then install it.
Notice
The installation commands below can be used in an environment with internet access. If you are in an environment without internet, you need to download libnvdia-nscq from https://developer.download.nvidia.com/compute/cuda/repos/ and upload it.
  1. Check the libnvdia-nscq package.

    Color mode
    rpm -qa | grep libnvidia-nscq libnvidia-nscq-535-535.183.06-1.x86_64
    rpm -qa | grep libnvidia-nscq libnvidia-nscq-535-535.183.06-1.x86_64
    Code block. NSCQ library package check

  2. Add CUDA Repository to DNF.

    Color mode
    dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
    dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
    Code block. Add DNF Repository

  3. NVDIA Driver status initialization

    Color mode
    dnf module reset nvidia-driver
    dnf module reset nvidia-driver
    Code block. NVIDIA Driver DNF module state initialization
    Color mode
    Updating Subscription Management repositories.
    Last metadata expiration check: 0:03:15 ago on Wed 19 Nov 2025 01:23:48 AM EST.
    Dependencies resolved.
    =============================================
    Package Architecture Version Repository Size
    =============================================
    Disabling module profiles:
    nvidia-driver/default
    nvidia-driver/fm
    Resetting modules:
    nvidia-driver
    
    Transaction Summary
    =============================================
    
    Is this ok [y/N]: y
    Updating Subscription Management repositories.
    Last metadata expiration check: 0:03:15 ago on Wed 19 Nov 2025 01:23:48 AM EST.
    Dependencies resolved.
    =============================================
    Package Architecture Version Repository Size
    =============================================
    Disabling module profiles:
    nvidia-driver/default
    nvidia-driver/fm
    Resetting modules:
    nvidia-driver
    
    Transaction Summary
    =============================================
    
    Is this ok [y/N]: y
    Code block. Example of the result of state initialization of the NVIDIA Driver DNF module

  4. Activate the NVDIA Driver module.

    Color mode
    dnf module enable nvidia-driver:535-open
    dnf module enable nvidia-driver:535-open
    Code block. NVDIA Driver module activation
    Color mode
    Updating Subscription Management repositories.
    Last metadata expiration check: 0:04:22 ago on Wed 19 Nov 2025 01:23:48 AM EST.
    Dependencies resolved.
    =============================================
    
    Package Architecture Version Repository Size
    =============================================
    Enabling module streams:
    nvidia-driver 535-open
      
    Transaction Summary
    =============================================
    
    Is this ok [y/N]: y
    Updating Subscription Management repositories.
    Last metadata expiration check: 0:04:22 ago on Wed 19 Nov 2025 01:23:48 AM EST.
    Dependencies resolved.
    =============================================
    
    Package Architecture Version Repository Size
    =============================================
    Enabling module streams:
    nvidia-driver 535-open
      
    Transaction Summary
    =============================================
    
    Is this ok [y/N]: y
    Code block. NVDIA Driver module activation result example

  5. Check the libnvdia-nscq module list.

    Color mode
    dnf list libnvidia-nscq-535 --showduplicates
    dnf list libnvidia-nscq-535 --showduplicates
    Code block. libnvdia-nscq module list check

  6. Install libnvdia-nscq.

    Color mode
    dnf install libnvidia-nscq-535-535.183.06-1
    dnf install libnvidia-nscq-535-535.183.06-1
    Code block. libnvdia-nscq installation command

NVSwitch Device Monitoring API(NVSDM) Library Installation (for RHEL)

Reference
After Blackwell, GPU Architecture requires NVSDM Library installation. NVDIA Driver versions 560 and below do not provide the NVSDM Library.
  1. Check the NVSDM library module list.

    Color mode
    dnf list libnvsdm --showduplicates
    dnf list libnvsdm --showduplicates
    Code block. Check NVSDM library module list
    Color mode
    libnvsdm:
      Installed: (none)
      Candidate: 580.105.08-1
      Version table:
         580.105.08-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         580.95.05-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         580.82.07-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         580.65.06-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
    libnvsdm:
      Installed: (none)
      Candidate: 580.105.08-1
      Version table:
         580.105.08-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         580.95.05-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         580.82.07-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
         580.65.06-1 600
            600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages
    Code block. NVSDM library module list verification result example

  2. Install libnvsdm.

    Color mode
    dnf install libnvsdm-580.105.08-1
    dnf install libnvsdm-580.105.08-1
    Code block. NVSDM library installation
    Color mode
    Updating Subscription Management repositories.
    Last metadata expiration check: 0:08:18 ago on Wed 19 Nov 2025 01:05:28 AM EST.
    Dependencies resolved.
    =========================================================================
    
    Package Architecture Version Repository Size
    =========================================================================
    
    Installing:
    libnvsdm x86_64 580.105.08-1 cuda-rhel8-x86_64 675 k
    Installing dependencies:
    infiniband-diags x86_64 48.0-1.el8 rhel-8-for-x86_64-baseos-rpms 323 k
    libibumad x86_64 48.0-1.el8 rhel-8-for-x86_64-baseos-rpms 34 k
    
    Transaction Summary
    =========================================================================
    
    Install 3 Packages
    
    Total download size: 1.0 M
    Installed size: 3.2 M
    Is this ok [y/N]: y
    Updating Subscription Management repositories.
    Last metadata expiration check: 0:08:18 ago on Wed 19 Nov 2025 01:05:28 AM EST.
    Dependencies resolved.
    =========================================================================
    
    Package Architecture Version Repository Size
    =========================================================================
    
    Installing:
    libnvsdm x86_64 580.105.08-1 cuda-rhel8-x86_64 675 k
    Installing dependencies:
    infiniband-diags x86_64 48.0-1.el8 rhel-8-for-x86_64-baseos-rpms 323 k
    libibumad x86_64 48.0-1.el8 rhel-8-for-x86_64-baseos-rpms 34 k
    
    Transaction Summary
    =========================================================================
    
    Install 3 Packages
    
    Total download size: 1.0 M
    Installed size: 3.2 M
    Is this ok [y/N]: y
    Code block. NVSDM library installation command result example

NVIDIA DCGM Installation (for RHEL)

Install Node Exporter according to the steps below.

  1. DCGM(datacenter-gpu-manager) Installation
  2. datacenter-gpu-manager-exporter installation
  3. DCGM Service Activation and Start

DCGM(datacenter-gpu-manager) Installation (for RHEL)

It refers to a specific version of NVIDIA’s Data Center GPU Manager (DCGM) tool, which is a package for managing and monitoring NVIDIA data center GPUs. Specifically, cuda12 indicates that this management tool is installed for CUDA version 12, and datacenter‑gpu‑manager‑4 denotes the 4.x version of DCGM. This tool provides various functions including GPU status monitoring, diagnostics, alert system, and power/clock management.

  1. Add CUDA Repository to DNF.
    Color mode
    dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
    dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
    Code block. Add DNF Repository
  2. Check the CUDA version.
    Color mode
    nvidia-smi | grep CUDA
    nvidia-smi | grep CUDA
    Code block. Check CUDA version
    Color mode
    | NVIDIA-SMI 535.183.06             Driver Version: 535.183.06     CUDA Version: 12.2     |
    | NVIDIA-SMI 535.183.06             Driver Version: 535.183.06     CUDA Version: 12.2     |
    Code block. Example of CUDA version check result
    Color mode
    CUDA_VERSION=12
    CUDA_VERSION=12
    Code block. CUDA version setting command
  3. Check the list of datacenter-gpu-manager-cuda modules.
    Color mode
    dnf list datacenter-gpu-manager-4-cuda${CUDA_VERSION} --showduplicates
    dnf list datacenter-gpu-manager-4-cuda${CUDA_VERSION} --showduplicates
    Code block. Check datacenter-gpu-manager-cuda module list
    Color mode
    Updating Subscription Management repositories.
    Unable to read consumer identity
    
    This system is not registered with an entitlement server. You can use subscription-manager to register.
    
    Last metadata expiration check: 0:00:34 ago on Wed 19 Nov 2025 12:26:56 AM EST.
    Available Packages
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.0.0-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.1.0-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.1.1-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.2.0-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.2.2-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.2.3-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.2.3-2    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.3.0-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.3.1-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.4.0-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.4.1-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.4.2-1    cuda-rhel8-x86_64
    Updating Subscription Management repositories.
    Unable to read consumer identity
    
    This system is not registered with an entitlement server. You can use subscription-manager to register.
    
    Last metadata expiration check: 0:00:34 ago on Wed 19 Nov 2025 12:26:56 AM EST.
    Available Packages
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.0.0-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.1.0-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.1.1-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.2.0-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.2.2-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.2.3-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.2.3-2    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.3.0-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.3.1-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.4.0-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.4.1-1    cuda-rhel8-x86_64
    datacenter-gpu-manager-4-cuda12.x86_64   1:4.4.2-1    cuda-rhel8-x86_64
    Code block. datacenter-gpu-manager-cuda module list check result example
  4. Install datacenter-gpu-manager-cuda.
    Color mode
    dnf install datacenter-gpu-manager-4-cuda${CUDA_VERSION}
    dnf install datacenter-gpu-manager-4-cuda${CUDA_VERSION}
    Code block. datacenter-gpu-manager-cuda installation
    Color mode
    Updating Subscription Management repositories.
    Unable to read consumer identity
    
    This system is not registered with an entitlement server. You can use subscription-manager to register.
    
    Last metadata expiration check: 0:07:12 ago on Wed 19 Nov 2025 12:26:56 AM EST.
    Dependencies resolved.
    ===================================================================================================
    Package                                       Architecture   Version     Repository          Size
    ===================================================================================================
    Installing:
     datacenter-gpu-manager-4-cuda12               x86_64         1:4.4.2-1   cuda-rhel8-x86_64   554 M
    Installing dependencies:
     datacenter-gpu-manager-4-core                 x86_64         1:4.4.2-1   cuda-rhel8-x86_64   9.9 M
    Installing weak dependencies:
     datacenter-gpu-manager-4-proprietary          x86_64         1:4.4.2-1   cuda-rhel8-x86_64   5.3 M
     datacenter-gpu-manager-4-proprietary-cuda12   x86_64         1:4.4.2-1   cuda-rhel8-x86_64   289 M
    
    Transaction Summary
    ====================================================================================================
    Install  4 Packages
    ...
    Is this ok [y/N]: y
    Updating Subscription Management repositories.
    Unable to read consumer identity
    
    This system is not registered with an entitlement server. You can use subscription-manager to register.
    
    Last metadata expiration check: 0:07:12 ago on Wed 19 Nov 2025 12:26:56 AM EST.
    Dependencies resolved.
    ===================================================================================================
    Package                                       Architecture   Version     Repository          Size
    ===================================================================================================
    Installing:
     datacenter-gpu-manager-4-cuda12               x86_64         1:4.4.2-1   cuda-rhel8-x86_64   554 M
    Installing dependencies:
     datacenter-gpu-manager-4-core                 x86_64         1:4.4.2-1   cuda-rhel8-x86_64   9.9 M
    Installing weak dependencies:
     datacenter-gpu-manager-4-proprietary          x86_64         1:4.4.2-1   cuda-rhel8-x86_64   5.3 M
     datacenter-gpu-manager-4-proprietary-cuda12   x86_64         1:4.4.2-1   cuda-rhel8-x86_64   289 M
    
    Transaction Summary
    ====================================================================================================
    Install  4 Packages
    ...
    Is this ok [y/N]: y
    Code block. datacenter-gpu-manager-cuda installation result example

datacenter-gpu-manager-exporter installation (for RHEL)

It is a tool that, based on NVIDIA Data Center GPU Manager (DCGM), collects various GPU metrics such as GPU usage, memory usage, temperature, and power consumption, and exposes them for use in monitoring systems like Prometheus.

  1. Add the CUDA Repository to DNF. (If you have already performed this command, proceed to the next step.)

    Color mode
    dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
    dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
    Code block. Add DNF Repository

  2. Check the CUDA version. (If you have already performed this command, proceed to the next step.)

    Color mode
    nvidia-smi | grep CUDA
    nvidia-smi | grep CUDA
    Code block. Check CUDA version
    Color mode
    | NVIDIA-SMI 535.183.06             Driver Version: 535.183.06     CUDA Version: 12.2     |
    | NVIDIA-SMI 535.183.06             Driver Version: 535.183.06     CUDA Version: 12.2     |
    Code block. Example of CUDA version check result
    Color mode
    CUDA_VERSION=12
    CUDA_VERSION=12
    Code block. CUDA version setting command

  3. Check the list of datacenter-gpu-manager-exporter modules.

    Color mode
    dnf list datacenter-gpu-manager-exporter --showduplicates
    dnf list datacenter-gpu-manager-exporter --showduplicates
    Code block. datacenter-gpu-manager-exporter module list check
    Color mode
    Updating Subscription Management repositories.
    Unable to read consumer identity
    
    This system is not registered with an entitlement server. You can use subscription-manager to register.
    
    Last metadata expiration check: 0:02:11 ago on Wed 19 Nov 2025 12:26:56 AM EST.
    Available Packages
    datacenter-gpu-manager-exporter.x86_64   4.0.1-1   cuda-rhel8-x86_64
    datacenter-gpu-manager-exporter.x86_64   4.1.0-1   cuda-rhel8-x86_64
    datacenter-gpu-manager-exporter.x86_64   4.1.1-1   cuda-rhel8-x86_64
    datacenter-gpu-manager-exporter.x86_64   4.1.3-1   cuda-rhel8-x86_64
    datacenter-gpu-manager-exporter.x86_64   4.5.0-1   cuda-rhel8-x86_64
    datacenter-gpu-manager-exporter.x86_64   4.5.1-1   cuda-rhel8-x86_64
    datacenter-gpu-manager-exporter.x86_64   4.5.2-1   cuda-rhel8-x86_64
    datacenter-gpu-manager-exporter.x86_64   4.6.0-1   cuda-rhel8-x86_64
    Updating Subscription Management repositories.
    Unable to read consumer identity
    
    This system is not registered with an entitlement server. You can use subscription-manager to register.
    
    Last metadata expiration check: 0:02:11 ago on Wed 19 Nov 2025 12:26:56 AM EST.
    Available Packages
    datacenter-gpu-manager-exporter.x86_64   4.0.1-1   cuda-rhel8-x86_64
    datacenter-gpu-manager-exporter.x86_64   4.1.0-1   cuda-rhel8-x86_64
    datacenter-gpu-manager-exporter.x86_64   4.1.1-1   cuda-rhel8-x86_64
    datacenter-gpu-manager-exporter.x86_64   4.1.3-1   cuda-rhel8-x86_64
    datacenter-gpu-manager-exporter.x86_64   4.5.0-1   cuda-rhel8-x86_64
    datacenter-gpu-manager-exporter.x86_64   4.5.1-1   cuda-rhel8-x86_64
    datacenter-gpu-manager-exporter.x86_64   4.5.2-1   cuda-rhel8-x86_64
    datacenter-gpu-manager-exporter.x86_64   4.6.0-1   cuda-rhel8-x86_64
    Code block. datacenter-gpu-manager-exporter module list check result example

  4. Install datacenter-gpu-manager-cuda. dcgm-exporter 4.5.X requires glibc 2.34 or higher, but since glibc 2.34 is provided in RHEL9, we specify the version as 4.1.3-1 for installation.

    Color mode
    dnf install datacenter-gpu-manager-exporter-4.1.3-1
    dnf install datacenter-gpu-manager-exporter-4.1.3-1
    Code block. datacenter-gpu-manager-cuda installation
    Color mode
    Updating Subscription Management repositories.
    Unable to read consumer identity
      
    This system is not registered with an entitlement server. You can use subscription-manager to register.
    
    Last metadata expiration check: 0:07:12 ago on Wed 19 Nov 2025 12:26:56 AM EST.
    Dependencies resolved.
    ====================================================================================================
    Package                                       Architecture   Version     Repository          Size
    ====================================================================================================
    Installing:
     datacenter-gpu-manager-exporter               x86_64         4.1.3-1     cuda-rhel8-x86_64   26 M
    
    ...
    Is this ok [y/N]: y
    Updating Subscription Management repositories.
    Unable to read consumer identity
      
    This system is not registered with an entitlement server. You can use subscription-manager to register.
    
    Last metadata expiration check: 0:07:12 ago on Wed 19 Nov 2025 12:26:56 AM EST.
    Dependencies resolved.
    ====================================================================================================
    Package                                       Architecture   Version     Repository          Size
    ====================================================================================================
    Installing:
     datacenter-gpu-manager-exporter               x86_64         4.1.3-1     cuda-rhel8-x86_64   26 M
    
    ...
    Is this ok [y/N]: y
    Code block. datacenter-gpu-manager-cuda installation result example
    Color mode
    cat /usr/lib/systemd/system/nvidia-dcgm-exporter.service | grep ExecStart
    cat /usr/lib/systemd/system/nvidia-dcgm-exporter.service | grep ExecStart
    Code block. datacenter-gpu-manager-exporter configuration file
    Color mode
    ExecStart=/usr/bin/dcgm-exporter -f /etc/dcgm-exporter/default-counters.csv
    ExecStart=/usr/bin/dcgm-exporter -f /etc/dcgm-exporter/default-counters.csv
    Code block. datacenter-gpu-manager-exporter configuration file check result example

  5. When installing DCGM Exporter, check the provided settings and remove # from the required metrics, and add # to the unnecessary metrics.

    Color mode
    vi /etc/dcgm-exporter/default-counters.csv
    ## Example ##
    ...
    DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active.
    DCGM_FI_PROF_DRAM_ACTIVE,        gauge, Ratio of cycles the device memory interface is active sending or receiving data.
    # DCGM_FI_PROF_PIPE_FP64_ACTIVE,   gauge, Ratio of cycles the fp64 pipes are active.
    # DCGM_FI_PROF_PIPE_FP32_ACTIVE,   gauge, Ratio of cycles the fp32 pipes are active.
    ...
    vi /etc/dcgm-exporter/default-counters.csv
    ## Example ##
    ...
    DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active.
    DCGM_FI_PROF_DRAM_ACTIVE,        gauge, Ratio of cycles the device memory interface is active sending or receiving data.
    # DCGM_FI_PROF_PIPE_FP64_ACTIVE,   gauge, Ratio of cycles the fp64 pipes are active.
    # DCGM_FI_PROF_PIPE_FP32_ACTIVE,   gauge, Ratio of cycles the fp32 pipes are active.
    ...
    Code block. datacenter-gpu-manager-exporter metric configuration example

Reference
For the metrics that can be collected with GPU DCGM Exporter and how to configure them, see DCGM Exporter Metrics.
Caution
Since metric collection via ServiceWatch Agent is classified as custom metrics and incurs charges unlike the default collected metrics, unnecessary metric collection must be removed or disabled to avoid excessive charges.

DCGM Service Activation and Start (for RHEL)

  1. Enable and start the nvdia-dcgm service.

    Color mode
    systemctl enable --now nvidia-dcgm
    systemctl enable --now nvidia-dcgm
    Code block. nvdia-dcgm service activation and start command

  2. Activate and start the nvdia-dcgm-exporter service.

    Color mode
    systemctl enable --now nvidia-dcgm-exporter
    systemctl enable --now nvidia-dcgm-exporter
    Code block. nvdia-dcgm-exporter service activation and start command

Notice
If you have completed the DCGM Exporter setup, you need to install the Open Telemetry Collector provided by ServiceWatch to complete the ServiceWatch Agent configuration.
For more details, see ServiceWatch > ServiceWatch Agent Using.

DCGM Exporter Metrics

DCGM Exporter Key Metrics

Among the metrics provided by DCGM Exporter, the main GPU metrics are as follows.

CategoryDCGM FieldPrometheus Metric TypeSummary
ClocksDCGM_FI_DEV_SM_CLOCKgaugeSM clock frequency (in MHz)
ClocksDCGM_FI_DEV_MEM_CLOCKgaugeMemory clock frequency (in MHz)
TemperatureDCGM_FI_DEV_GPU_TEMPgaugeGPU temperature (in C)
PowerDCGM_FI_DEV_POWER_USAGEgaugePower draw (in W)
UtilizationDCGM_FI_DEV_GPU_UTILgaugeGPU utilization (in %)
UtilizationDCGM_FI_DEV_MEM_COPY_UTILgaugeMemory utilization (in %)
Memory UsageDCGM_FI_DEV_FB_FREEgaugeFrame buffer memory free (in MiB)
Memory UsageDCGM_FI_DEV_FB_USEDgaugeFrame buffer memory used (in MiB)
NvlinkDCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL(8 GPU only)counterTotal number of NVLink bandwidth counters for all lanes
Table. Major GPU metrics provided by DCGM Exporter

DCGM Exporter Metric Collection Settings

Please refer to the default metrics of DCGM Exporter at DCGM Exporter > Default Metrics.

  • For indicators to be set in addition to the default settings, remove # in default-counters.csv.
  • For metrics that are not desired to be collected among the default metrics, add # or delete the item.
Color mode
# Format
# If line starts with a '#' it is considered a comment
# DCGM FIELD, Prometheus metric type, help message

# Clocks
DCGM_FI_DEV_SM_CLOCK,  gauge, SM clock frequency (in MHz).
DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).

# Temperature
DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).
DCGM_FI_DEV_GPU_TEMP,    gauge, GPU temperature (in C).

# Power
DCGM_FI_DEV_POWER_USAGE,              gauge, Power draw (in W).
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ).

# PCIE
# DCGM_FI_PROF_PCIE_TX_BYTES,  counter, Total number of bytes transmitted through PCIe TX via NVML.
# DCGM_FI_PROF_PCIE_RX_BYTES,  counter, Total number of bytes received through PCIe RX via NVML.
...
# Format
# If line starts with a '#' it is considered a comment
# DCGM FIELD, Prometheus metric type, help message

# Clocks
DCGM_FI_DEV_SM_CLOCK,  gauge, SM clock frequency (in MHz).
DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).

# Temperature
DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).
DCGM_FI_DEV_GPU_TEMP,    gauge, GPU temperature (in C).

# Power
DCGM_FI_DEV_POWER_USAGE,              gauge, Power draw (in W).
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ).

# PCIE
# DCGM_FI_PROF_PCIE_TX_BYTES,  counter, Total number of bytes transmitted through PCIe TX via NVML.
# DCGM_FI_PROF_PCIE_RX_BYTES,  counter, Total number of bytes received through PCIe RX via NVML.
...
Code block. default-counters.csv setting example
Keypair Management
Release Note