The page has been translated by Gen AI.

ServiceWatch Agent Install

Users can install the ServiceWatch Agent on the GPU node of a Multi-node GPU Cluster to collect custom metrics and logs.

Reference
Collecting custom metrics/logs via ServiceWatch Agent is currently only available on Samsung Cloud Platform For Enterprise. It will be offered in other offerings in the future.
Caution
Since metric collection via ServiceWatch Agent is classified as custom metrics and incurs charges unlike the default collected metrics, it is recommended to remove or disable unnecessary metric collection settings.

ServiceWatch Agent

The agents that need to be installed on the GPU nodes of a multi-node GPU cluster for collecting ServiceWatch custom metrics and logs can be broadly divided into two types. This is Prometheus Exporter and Open Telemetry Collector.

CategoryDetailed description
Prometheus ExporterProvides metrics of a specific application or service in a format that Prometheus can scrape
  • For collecting OS metrics of a GPU Node, you can use Node Exporter for Linux servers and Windows Exporter for Windows servers depending on the OS type.
Open Telemetry CollectorActs as a centralized collector that gathers telemetry data such as metrics and logs from distributed systems, processes (filtering, sampling, etc.) them, and then exports to various backends (e.g., Prometheus, Jaeger, Elasticsearch, etc.)
  • Exports data to the ServiceWatch Gateway so that ServiceWatch can collect metric and log data.
Table. Description of Prometheus Exporter and Open Telemetry Collector
Notice

If you have configured Kubernetes Engine on a GPU Node, please check GPU metrics through the metrics provided by Kubernetes Engine.

  • If you install the DCGM Exporter on a GPU node where Kubernetes Engine is configured, it may not work properly.
Reference
The ServiceWatch Agent guide for collecting GPU metrics on a GPU Node can be used the same as on a GPU Server. For more details, see GPU Server > ServiceWatch Agent.
Cluster Fabric Management
Multi-node GPU Cluster Service Scope and Inspection Guide