The page has been translated by Gen AI.
Monitoring Metrics
Multi-node GPU Cluster monitoring metrics
The following table shows the monitoring metrics of Multi-node GPU Cluster that can be checked through Cloud Monitoring.
Guide
Multi-node GPU Cluster requires the user to install the Agent through the guide to view monitoring metrics. Please install the Agent before using the stable service. For the Agent installation method and detailed Cloud Monitoring usage, please refer to the Cloud Monitoring guide.
Multi-node GPU Cluster [Cluster]
| Performance Item | Detailed Description | Unit |
|---|---|---|
| Memory Total [Basic] | Available memory bytes | bytes |
| Memory Used [Basic] | Currently used memory bytes | bytes |
| Memory Swap In [Basic] | Replaced memory bytes | bytes |
| Memory Swap Out [Basic] | Swapped out memory bytes | bytes |
| Memory Free [Basic] | Unused memory bytes | bytes |
| Disk Read Bytes [Basic] | Read bytes | bytes |
| Disk Read Requests [Basic] | Number of Read Requests | cnt |
| Disk Write Bytes [Basic] | Write bytes | bytes |
| Disk Write Requests [Basic] | Write Request Count | cnt |
| CPU Usage [Basic] | 1-minute average system CPU usage rate | % |
| Instance State [Basic] | Instance Status | state |
| Network In Bytes [Basic] | Received bytes | bytes |
| Network In Dropped [Basic] | Receive Packet Drop | cnt |
| Network In Packets [Basic] | Received Packet Count | cnt |
| Network Out Bytes [Basic] | Transmission bytes | bytes |
| Network Out Dropped [Basic] | Transmission Packet Drop | cnt |
| Network Out Packets [Basic] | Transmission Packet Count | cnt |
Fig. Multi-node GPU Cluster [Cluster] Monitoring Metrics (Default Provided)
| Performance Item | Detailed Description | Unit |
|---|---|---|
| Cluster GPU Count | Cluster 내 GPU Count SUM
| cnt |
| Cluster GPU Count In Use | Cluster 내 Job이 수행중인 GPU 수
| cnt |
| Cluster GPU Usage | Cluster internal GPU Utilization AVG
| % |
| Cluster GPU Memory Usage [Avg] | Cluster GPU Memory Utilization AVG
| % |
Table. Multi-node GPU Cluster [Cluster] additional monitoring metrics (Agent installation required)
Multi-node GPU Cluster [Node]
| Performance Item | Detailed Description | Unit |
|---|---|---|
| Memory Total [Basic] | Available memory bytes | bytes |
| Memory Used [Basic] | bytes of memory currently being used | bytes |
| Memory Swap In [Basic] | Replaced memory bytes | bytes |
| Memory Swap Out [Basic] | Swapped out memory bytes | bytes |
| Memory Free [Basic] | unused memory bytes | bytes |
| Disk Read Bytes [Basic] | Read bytes | bytes |
| Disk Read Requests [Basic] Disk Read Requests Count | cnt | |
| Disk Write Bytes [Basic] | Write bytes | bytes |
| Disk Write Requests [Basic] | Write Request Count | cnt |
| CPU Usage [Basic] | 1-minute average system CPU usage rate | % |
| Instance State [Basic] | Instance Status | state |
| Network In Bytes [Basic] | Received bytes | bytes |
| Network In Dropped [Basic] | Received Packet Drop | cnt |
| Network In Packets [Basic] | Received Packet Count | cnt |
| Network Out Bytes [Basic] | Transmission bytes | bytes |
| Network Out Dropped [Basic] | Transmission Packet Drop | cnt |
| Network Out Packets [Basic] | Transmission packet count | cnt |
Fig. Multi-node GPU Cluster [Node] Monitoring Metrics (Default Provided)
| Performance Item | Detailed Description | Unit |
|---|---|---|
| GPU Count | GPU count | cnt |
| GPU Temperature | GPU temperature | ℃ |
| GPU Usage | utilization | % |
| GPU Usage [Avg] | GPU Overall Average Usage Rate(%) | % |
| GPU Power Cap | GPU’s maximum power capacity | W |
| GPU Power Usage | GPU’s current power usage | W |
| GPU Memory Usage [Avg] | GPU Memory Uti. AVG | % |
| GPU Count in use | Node’s Job-running GPU count | cnt |
| Execution Status for nvidia-smi | nvidia-smi command execution result | status |
| Core Usage [IO Wait] | The ratio of CPU time spent in waiting state (disk waiting) | % |
| Core Usage [System] | The ratio of CPU time spent in kernel space | % |
| -Core Usage [User]- | -The ratio of CPU time spent in user space- | -%- |
| CPU Cores | The number of CPU cores on the host. The maximum value of the unnormalized ratio is 100%* of the cores. The unnormalized ratio already reflects this value, and the maximum value is 100%* of the cores. | cnt |
| CPU Usage [Active] | Percentage of CPU time used excluding Idle and IOWait states (if all 4 cores are used at 100%: 400%) | % |
| CPU Usage [Idle] | The ratio of CPU time spent in idle state. | % |
| CPU Usage [IO Wait] | The percentage of CPU time spent in waiting state (disk waiting) | % |
| CPU Usage [System] | Percentage of CPU time used by the kernel (in case of using all 4 cores 100%: 400%) | % |
| CPU Usage [User] | Percentage of CPU time used in the user area. (In case of using all 4 cores 100%, 400%) | % |
| CPU Usage/Core [Active] | Percentage of CPU time used excluding Idle and IOWait states (normalized value by number of cores, 100% if all 4 cores are used at 100%) | % |
| CPU Usage/Core [Idle] | The ratio of CPU time spent in idle state. | % |
| CPU Usage/Core [IO Wait] | The ratio of CPU time spent in waiting state (disk waiting) | % |
| CPU Usage/Core [System] | Percentage of CPU time used by the kernel (normalized value by number of cores, 100% if all 4 cores are used at 100%) | % |
| CPU Usage/Core [User] | Percentage of CPU time used in the user area. (normalized value by number of cores, 100% if all 4 cores are used at 100%) | % |
| Disk CPU Usage [IO Request] | The ratio of CPU time spent executing input/output requests for the device (device bandwidth utilization). If this value is close to 100%, the device is in a saturated state. | % |
| Disk Queue Size [Avg] | The average queue length of requests executed for the device. | num |
| Disk Read Bytes | The number of bytes read from the device per second. | bytes |
| Disk Read Bytes [Delta Avg] | Average of system.diskio.read.bytes_delta for each disk | bytes |
| Disk Read Bytes [Delta Max] | Individual disks’ system.diskio.read.bytes_delta maximum | bytes |
| Disk Read Bytes [Delta Min] | Individual disks’ minimum system.diskio.read.bytes_delta | bytes |
| Disk Read Bytes [Delta Sum] | The sum of system.diskio.read.bytes_delta of individual disks | bytes |
| Disk Read Bytes [Delta] | Delta value of system.diskio.read.bytes for each disk | bytes |
| Disk Read Bytes [Success] | The total number of bytes read successfully. In Linux, it is assumed that the sector size is 512 and the value multiplied by the number of sectors read by 512 | bytes |
| Disk Read Requests | The number of read requests for the disk device in 1 second | cnt |
| Disk Read Requests [Delta Avg] | Average of system.diskio.read.count_delta for each disk | cnt |
| Disk Read Requests [Delta Max] | Maximum of system.diskio.read.count_delta for individual disks | cnt |
| Disk Read Requests [Delta Min] | Minimum of system.diskio.read.count_delta for each disk | cnt |
| Disk Read Requests [Delta Sum] | Sum of system.diskio.read.count_delta of individual disks | cnt |
| Disk Read Requests [Success Delta] | Individual disk’s system.diskio.read.count delta | cnt |
| Disk Read Requests [Success] | Total number of successful read completions | cnt |
| Disk Request Size [Avg] | The average size of requests executed for the device (unit: sector) | num |
| Disk Service Time [Avg] | The average service time (in milliseconds) for input requests executed on the device. | ms |
| Disk Wait Time [Avg] | The average time spent on requests executed for supported devices. | ms |
| Disk Wait Time [Read] | Disk Average Wait Time | ms |
| Disk Wait Time [Write] | Disk Average Wait Time | ms |
| Disk Write Bytes [Delta Avg] | Average of system.diskio.write.bytes_delta for each disk | bytes |
| Disk Write Bytes [Delta Max] | Maximum of system.diskio.write.bytes_delta for each disk | bytes |
| Disk Write Bytes [Delta Min] | Individual disks’ minimum system.diskio.write.bytes_delta | bytes |
| Disk Write Bytes [Delta Sum] | The sum of system.diskio.write.bytes_delta of individual disks | bytes |
| Disk Write Bytes [Delta] | Delta value of system.diskio.write.bytes for each disk | bytes |
| Disk Write Bytes [Success] | The total number of bytes written successfully. In Linux, it is assumed that the sector size is 512 and the value is multiplied by 512 to the number of sectors written | bytes |
| Disk Write Requests | The number of write requests to the disk device for 1 second | cnt |
| Disk Write Requests [Delta Avg] | Average of system.diskio.write.count_delta of individual disks | cnt |
| Disk Write Requests [Delta Max] | Maximum of system.diskio.write.count_delta for each disk | cnt |
| Disk Write Requests [Delta Min] | Minimum of system.diskio.write.count_delta for individual disks | cnt |
| Disk Write Requests [Delta Sum] | Sum of system.diskio.write.count_delta of individual disks | cnt |
| Disk Write Requests [Success Delta] | Individual disk’s system.diskio.write.count delta | cnt |
| Disk Write Requests [Success] | Total number of writes completed successfully | cnt |
| Disk Writes Bytes | The number of bytes written to the device per second. | bytes |
| Filesystem Hang Check | filesystem(local/NFS) hang check (normal:1, abnormal:0) | status |
| Filesystem Nodes | The total number of file nodes in the file system. | cnt |
| Filesystem Nodes [Free] | The total number of available file nodes in the file system. | cnt |
| Filesystem Size [Available] | This is the disk space (in bytes) that can be used by unauthorized users. | bytes |
| Filesystem Size [Free] | Available disk space (bytes) | bytes |
| Filesystem Size [Total] | Total Disk Space (bytes) | bytes |
| Filesystem Usage | Used Disk Space Percentage | % |
| Filesystem Usage [Avg] | Average of individual filesystem.used.pct | % |
| Filesystem Usage [Inode] | _inode usage rate | % |
| Filesystem Usage [Max] | Maximum value among individual filesystem usage percentages | % |
| Filesystem Usage [Min] | Minimum of individual filesystem used percentages | % |
| Filesystem Usage [Total] | - | % |
| Filesystem Used | Used Disk Space (bytes) | bytes |
| Filesystem Used [Inode] | Inode usage | bytes |
| Memory Free | The total amount of available memory (bytes). It does not include memory used by system cache and buffers (see system.memory.actual.free). | bytes |
| Memory Free [Actual] | Actual available memory (bytes). The calculation method varies depending on the OS, and in Linux, it is either MemAvailable from /proc/meminfo or calculated from available memory, cache, and buffer if meminfo is not available. On OSX, it is the sum of available memory and inactive memory. On Windows, it is the same as system.memory.free. | bytes |
| Memory Free [Swap] | Available swap memory. | bytes |
| Memory Total | Total Memory | bytes |
| Memory Total [Swap] | Total swap memory. | bytes |
| Memory Usage | Used memory percentage
| % |
| Memory Usage [Actual] | The percentage of memory actually used
| % |
| Memory Usage [Cache Swap] | Cache swap usage rate | % |
| Memory Usage [Swap] | Used swap memory percentage | % |
| Memory Used | Used Memory | bytes |
| Memory Used [Actual] | Actual used memory (bytes). The value subtracted from the total memory by the used memory. The available memory is calculated differently depending on the OS (refer to system.actual.free) | bytes |
| Memory Used [Swap] | Used swap memory. | bytes |
| Collisions | Network Collisions | cnt |
| Network In Bytes | Received byte count | bytes |
| Network In Bytes [Delta Avg] | Average of system.network.in.bytes_delta for each network | bytes |
| Network In Bytes [Delta Max] | Maximum of system.network.in.bytes_delta for each network | bytes |
| Network In Bytes [Delta Min] | Minimum of system.network.in.bytes_delta for each network | bytes |
| Network In Bytes [Delta Sum] | Sum of each network’s system.network.in.bytes_delta | bytes |
| Network In Bytes [Delta] | Received byte count delta | bytes |
| Network In Dropped | The number of packets deleted among incoming packets | cnt |
| Network In Errors | Number of errors during reception | cnt |
| Network In Packets | Received packet count | cnt |
| Network In Packets [Delta Avg] | Average of system.network.in.packets_delta for each network | cnt |
| Network In Packets [Delta Max] | Individual networks’ system.network.in.packets_delta maximum | cnt |
| Network In Packets [Delta Min] | Minimum of system.network.in.packets_delta for each network | cnt |
| Network In Packets [Delta Sum] | Sum of system.network.in.packets_delta of individual networks | cnt |
| Network In Packets [Delta] | Received packet count delta | cnt |
| Network Out Bytes | Transmitted byte count | bytes |
| Network Out Bytes [Delta Avg] | Average of system.network.out.bytes_delta for each network | bytes |
| Network Out Bytes [Delta Max] | Individual networks’ system.network.out.bytes_delta maximum | bytes |
| Network Out Bytes [Delta Min] | Minimum of system.network.out.bytes_delta for each network | bytes |
| Network Out Bytes [Delta Sum] | The sum of system.network.out.bytes_delta of individual networks | bytes |
| Network Out Bytes [Delta] | Transmitted byte count delta | bytes |
| Network Out Dropped | Number of packets dropped among outgoing packets. This value is not reported by the operating system, so it is always 0 in Darwin and BSD | cnt |
| Network Out Errors | Number of errors during transmission | cnt |
| Network Out Packets | Number of transmitted packets | cnt |
| Network Out Packets [Delta Avg] | Average of system.network.out.packets_delta for each network | cnt |
| Network Out Packets [Delta Max] | Maximum of system.network.out.packets_delta for each network | cnt |
| Network Out Packets [Delta Min] | Individual networks’ minimum system.network.out.packets_delta | cnt |
| Network Out Packets [Delta Sum] | Sum of system.network.out.packets_delta of individual networks | cnt |
| Network Out Packets [Delta] | Number of transmitted packets delta | cnt |
| Open Connections [TCP] | All open TCP connections | cnt |
| Open Connections [UDP] | All open UDP connections | cnt |
| Port Usage | Port usage available for connection | % |
| SYN Sent Sockets | Number of sockets in SYN_SENT state (when connecting from local to remote) | cnt |
| Kernel PID Max | kernel.pid_max value | cnt |
| Kernel Thread Max | kernel threads-max value | cnt |
| Process CPU Usage | Percentage of CPU time consumed by the process after the last update. This value is similar to the %CPU value of the process displayed by the top command on Unix systems | % |
| Process CPU Usage/Core | Percentage of CPU time used by the process since the last event, normalized by the number of cores, with a value between 0~100% | % |
| Process Memory Usage | main memory (RAM) where the process occupies a ratio | % |
| Process Memory Used | Resident Set size. The amount of memory a process occupies in RAM. In Windows, it is the current working set size | bytes |
| Process PID | Process PID | PID |
| Process PPID | Parent process’s pid | PID |
| Processes [Dead] | .dead processes count | cnt |
| Processes [Idle] | idle process count | cnt |
| Processes [Running] | Number of running processes | cnt |
| Processes [Sleeping] | sleeping processes count | cnt |
| Processes [Stopped] | Number of stopped processes | cnt |
| Processes [Total] | Total number of processes | cnt |
| Processes [Unknown] | Cannot search or unknown number of processes | cnt |
| Processes [Zombie] | Number of zombie processes | cnt |
| Running Process Usage | process usage rate | % |
| Running Processes | Number of running processes | cnt |
| Running Thread Usage | Thread usage rate | % |
| Running Threads | number of threads running in running processes | cnt |
| Instance Status | _instance status | state |
| Context Switches | context switch count (per second) | cnt |
| Load/Core [1 min] | Load for the last 1 minute divided by the number of cores | cnt |
| Load/Core [15 min] | The value of load divided by the number of cores for the last 15 minutes | cnt |
| Load/Core [5 min] | The value of load divided by the number of cores over the last 5 minutes | cnt |
| Multipaths [Active] | External storage connection path status = active count | cnt |
| Multipaths [Failed] | External storage connection path status = failed count | cnt |
| Multipaths [Faulty] | External storage connection path status = faulty count | cnt |
| NTP Offset | last sample’s measured offset (time difference between NTP server and local environment) | num |
| Run Queue Length | Execution Waiting Queue Length | num |
| Uptime | OS operation time (uptime). (milliseconds) | ms |
| Context Switchies | CPU context switch count (per second) | cnt |
| Disk Read Bytes [Sec] | number of bytes read from the windows logical disk in 1 second | cnt |
| Disk Read Time [Avg] | Data Read Average Time (sec) | sec |
| Disk Transfer Time [Avg] | Disk average wait time | sec |
| Disk Usage | Disk Usage Rate | % |
| Disk Write Bytes [Sec] | number of bytes written to the windows logical disk in 1 second | cnt |
| Disk Write Time [Avg] | Data Write Average Time (sec) | sec |
| Pagingfile Usage | Paging file usage rate | % |
| Pool Used [Non Paged] | Kernel memory Non-paged pool usage | bytes |
| Pool Used [Paged] | Paged Pool usage among kernel memory | bytes |
| Process [Running] | The number of processes currently running | cnt |
| Threads [Running] | Number of threads currently running | cnt |
| Threads [Waiting] | The number of threads waiting for processor time | cnt |
Table. Multi-node GPU Cluster [Node] additional monitoring metrics (Agent installation required)