The page has been translated by Gen AI.
Monitoring Metrics
GPU Server Monitoring Metrics
The following table shows the monitoring metrics of the GPU Server that can be checked through Cloud Monitoring.
Even without installing an Agent, basic monitoring metrics are provided. Please check the Table. GPU Server Monitoring Metrics (Basic) below. Additionally, metrics that can be retrieved by installing an Agent are referenced in the Table. GPU Server Additional Monitoring Metrics (Agent Installation Required) below.
For detailed Cloud Monitoring usage, please refer to the Cloud Monitoring guide.
| Performance Item Name | Description | Unit |
|---|---|---|
| Memory Total [Basic] | Total available memory in bytes | bytes |
| Memory Used [Basic] | Currently used memory in bytes | bytes |
| Memory Swap In [Basic] | Swapped memory in bytes | bytes |
| Memory Swap Out [Basic] | Swapped memory in bytes | bytes |
| Memory Free [Basic] | Unused memory in bytes | bytes |
| Disk Read Bytes [Basic] | Read bytes | bytes |
| Disk Read Requests [Basic] | Number of read requests | cnt |
| Disk Write Bytes [Basic] | Written bytes | bytes |
| Disk Write Requests [Basic] | Number of write requests | cnt |
| CPU Usage [Basic] | Average system CPU usage over 1 minute | % |
| Instance State [Basic] | Instance state | state |
| Network In Bytes [Basic] | Received bytes | bytes |
| Network In Dropped [Basic] | Dropped received packets | cnt |
| Network In Packets [Basic] | Received packets | cnt |
| Network Out Bytes [Basic] | Sent bytes | bytes |
| Network Out Dropped [Basic] | Dropped sent packets | cnt |
| Network Out Packets [Basic] | Sent packets | cnt |
Table. GPU Server Basic Monitoring Metrics (Basic)
| Performance Item Name | Description | Unit |
|---|---|---|
| GPU Count | Number of GPUs | cnt |
| GPU Memory Usage | GPU memory usage rate | % |
| GPU Memory Used | Used GPU memory | MB |
| GPU Temperature | GPU temperature | ℃ |
| GPU Usage | GPU utilization | % |
| GPU Usage [Avg] | Average GPU usage rate | % |
| GPU Power Cap | Maximum power capacity of the GPU | W |
| GPU Power Usage | Current power usage of the GPU | W |
| GPU Memory Usage [Avg] | Average GPU memory usage rate | % |
| GPU Count in use | Number of GPUs in use by jobs on the node | cnt |
| Execution Status for nvidia-smi | Execution result of the nvidia-smi command | status |
| Core Usage [IO Wait] | CPU time spent in IO wait state | % |
| Core Usage [System] | CPU time spent in system space | % |
| Core Usage [User] | CPU time spent in user space | % |
| CPU Cores | Number of CPU cores on the host | cnt |
| CPU Usage [Active] | CPU time used, excluding idle and IO wait states | % |
| CPU Usage [Idle] | CPU time spent in idle state | % |
| CPU Usage [IO Wait] | CPU time spent in IO wait state | % |
| CPU Usage [System] | CPU time used by the kernel | % |
| CPU Usage [User] | CPU time used by user space | % |
| CPU Usage/Core [Active] | CPU time used per core, excluding idle and IO wait states | % |
| CPU Usage/Core [Idle] | CPU time spent in idle state per core | % |
| CPU Usage/Core [IO Wait] | CPU time spent in IO wait state per core | % |
| CPU Usage/Core [System] | CPU time used by the kernel per core | % |
| CPU Usage/Core [User] | CPU time used by user space per core | % |
| Disk CPU Usage [IO Request] | CPU time spent on IO requests | % |
| Disk Queue Size [Avg] | Average queue length of requests | num |
| Disk Read Bytes | Bytes read from the device per second | bytes |
| Disk Read Bytes [Delta Avg] | Average delta of bytes read from the device | bytes |
| Disk Read Bytes [Delta Max] | Maximum delta of bytes read from the device | bytes |
| Disk Read Bytes [Delta Min] | Minimum delta of bytes read from the device | bytes |
| Disk Read Bytes [Delta Sum] | Sum of delta of bytes read from the device | bytes |
| Disk Read Bytes [Delta] | Delta of bytes read from the device | bytes |
| Disk Read Bytes [Success] | Total bytes successfully read | bytes |
| Disk Read Requests | Number of read requests to the device per second | cnt |
| Disk Read Requests [Delta Avg] | Average delta of read requests to the device | cnt |
| Disk Read Requests [Delta Max] | Maximum delta of read requests to the device | cnt |
| Disk Read Requests [Delta Min] | Minimum delta of read requests to the device | cnt |
| Disk Read Requests [Delta Sum] | Sum of delta of read requests to the device | cnt |
| Disk Read Requests [Success Delta] | Delta of successful read requests to the device | cnt |
| Disk Read Requests [Success] | Total successful read requests | cnt |
| Disk Request Size [Avg] | Average size of requests to the device | num |
| Disk Service Time [Avg] | Average service time of requests to the device | ms |
| Disk Wait Time [Avg] | Average wait time of requests to the device | ms |
| Disk Wait Time [Read] | Average read wait time of the device | ms |
| Disk Wait Time [Write] | Average write wait time of the device | ms |
| Disk Write Bytes [Delta Avg] | Average delta of bytes written to the device | bytes |
| Disk Write Bytes [Delta Max] | Maximum delta of bytes written to the device | bytes |
| Disk Write Bytes [Delta Min] | Minimum delta of bytes written to the device | bytes |
| Disk Write Bytes [Delta Sum] | Sum of delta of bytes written to the device | bytes |
| Disk Write Bytes [Delta] | Delta of bytes written to the device | bytes |
| Disk Write Bytes [Success] | Total bytes successfully written | bytes |
| Disk Write Requests | Number of write requests to the device per second | cnt |
| Disk Write Requests [Delta Avg] | Average delta of write requests to the device | cnt |
| Disk Write Requests [Delta Max] | Maximum delta of write requests to the device | cnt |
| Disk Write Requests [Delta Min] | Minimum delta of write requests to the device | cnt |
| Disk Write Requests [Delta Sum] | Sum of delta of write requests to the device | cnt |
| Disk Write Requests [Success Delta] | Delta of successful write requests to the device | cnt |
| Disk Write Requests [Success] | Total successful write requests | cnt |
| Disk Writes Bytes | Bytes written to the device per second | bytes |
| Filesystem Hang Check | Filesystem hang check (normal: 1, abnormal: 0) | status |
| Filesystem Nodes | Total number of filesystem nodes | cnt |
| Filesystem Nodes [Free] | Total number of available filesystem nodes | cnt |
| Filesystem Size [Available] | Available disk space in bytes | bytes |
| Filesystem Size [Free] | Free disk space in bytes | bytes |
| Filesystem Size [Total] | Total disk space in bytes | bytes |
| Filesystem Usage | Disk space usage rate | % |
| Filesystem Usage [Avg] | Average disk space usage rate | % |
| Filesystem Usage [Inode] | Inode usage rate | % |
| Filesystem Usage [Max] | Maximum disk space usage rate | % |
| Filesystem Usage [Min] | Minimum disk space usage rate | % |
| Filesystem Usage [Total] | Total disk space usage rate | % |
| Filesystem Used | Used disk space in bytes | bytes |
| Filesystem Used [Inode] | Used inode space in bytes | bytes |
| Memory Free | Total available memory in bytes | bytes |
| Memory Free [Actual] | Actual available memory in bytes | bytes |
| Memory Free [Swap] | Available swap memory in bytes | bytes |
| Memory Total | Total memory in bytes | bytes |
| Memory Total [Swap] | Total swap memory in bytes | bytes |
| Memory Usage | Memory usage rate | % |
| Memory Usage [Actual] | Actual memory usage rate | % |
| Memory Usage [Cache Swap] | Cache swap usage rate | % |
| Memory Usage [Swap] | Swap memory usage rate | % |
| Memory Used | Used memory in bytes | bytes |
| Memory Used [Actual] | Actual used memory in bytes | bytes |
| Memory Used [Swap] | Used swap memory in bytes | bytes |
| Collisions | Network collisions | cnt |
| Network In Bytes | Received bytes | bytes |
| Network In Bytes [Delta Avg] | Average delta of received bytes | bytes |
| Network In Bytes [Delta Max] | Maximum delta of received bytes | bytes |
| Network In Bytes [Delta Min] | Minimum delta of received bytes | bytes |
| Network In Bytes [Delta Sum] | Sum of delta of received bytes | bytes |
| Network In Bytes [Delta] | Delta of received bytes | bytes |
| Network In Dropped | Dropped received packets | cnt |
| Network In Errors | Received errors | cnt |
| Network In Packets | Received packets | cnt |
| Network In Packets [Delta Avg] | Average delta of received packets | cnt |
| Network In Packets [Delta Max] | Maximum delta of received packets | cnt |
| Network In Packets [Delta Min] | Minimum delta of received packets | cnt |
| Network In Packets [Delta Sum] | Sum of delta of received packets | cnt |
| Network In Packets [Delta] | Delta of received packets | cnt |
| Network Out Bytes | Sent bytes | bytes |
| Network Out Bytes [Delta Avg] | Average delta of sent bytes | bytes |
| Network Out Bytes [Delta Max] | Maximum delta of sent bytes | bytes |
| Network Out Bytes [Delta Min] | Minimum delta of sent bytes | bytes |
| Network Out Bytes [Delta Sum] | Sum of delta of sent bytes | bytes |
| Network Out Bytes [Delta] | Delta of sent bytes | bytes |
| Network Out Dropped | Dropped sent packets | cnt |
| Network Out Errors | Sent errors | cnt |
| Network Out Packets | Sent packets | cnt |
| Network Out Packets [Delta Avg] | Average delta of sent packets | cnt |
| Network Out Packets [Delta Max] | Maximum delta of sent packets | cnt |
| Network Out Packets [Delta Min] | Minimum delta of sent packets | cnt |
| Network Out Packets [Delta Sum] | Sum of delta of sent packets | cnt |
| Network Out Packets [Delta] | Delta of sent packets | cnt |
| Open Connections [TCP] | Open TCP connections | cnt |
| Open Connections [UDP] | Open UDP connections | cnt |
| Port Usage | Port usage rate | % |
| SYN Sent Sockets | Number of sockets in SYN_SENT state | cnt |
| Kernel PID Max | Maximum PID value | cnt |
| Kernel Thread Max | Maximum thread value | cnt |
| Process CPU Usage | CPU time used by the process | % |
| Process CPU Usage/Core | CPU time used by the process per core | % |
| Process Memory Usage | Resident Set size | % |
| Process Memory Used | Used memory by the process | bytes |
| Process PID | Process ID | PID |
| Process PPID | Parent process ID | PID |
| Processes [Dead] | Number of dead processes | cnt |
| Processes [Idle] | Number of idle processes | cnt |
| Processes [Running] | Number of running processes | cnt |
| Processes [Sleeping] | Number of sleeping processes | cnt |
| Processes [Stopped] | Number of stopped processes | cnt |
| Processes [Total] | Total number of processes | cnt |
| Processes [Unknown] | Number of unknown processes | cnt |
| Processes [Zombie] | Number of zombie processes | cnt |
| Running Process Usage | Process usage rate | % |
| Running Processes | Number of running processes | cnt |
| Running Thread Usage | Thread usage rate | % |
| Running Threads | Number of running threads | cnt |
| Context Switches | Context switches per second | cnt |
| Load/Core [1 min] | Load per core over 1 minute | cnt |
| Load/Core [15 min] | Load per core over 15 minutes | cnt |
| Load/Core [5 min] | Load per core over 5 minutes | cnt |
| Multipaths [Active] | Number of active multipath connections | cnt |
| Multipaths [Failed] | Number of failed multipath connections | cnt |
| Multipaths [Faulty] | Number of faulty multipath connections | cnt |
| NTP Offset | Measured offset from the NTP server | num |
| Run Queue Length | Run queue length | num |
| Uptime | System uptime in milliseconds | ms |
| Context Switchies | Context switches per second | cnt |
| Disk Read Bytes [Sec] | Bytes read from the device per second | cnt |
| Disk Read Time [Avg] | Average read time from the device | sec |
| Disk Transfer Time [Avg] | Average disk transfer time | sec |
| Disk Usage | Disk usage rate | % |
| Disk Write Bytes [Sec] | Bytes written to the device per second | cnt |
| Disk Write Time [Avg] | Average write time to the device | sec |
| Pagingfile Usage | Paging file usage rate | % |
| Pool Used [Non Paged] | Non-paged pool usage | bytes |
| Pool Used [Paged] | Paged pool usage | bytes |
| Process [Running] | Number of running processes | cnt |
| Threads [Running] | Number of running threads | cnt |
| Threads [Waiting] | Number of waiting threads | cnt |
Table. GPU Server Additional Monitoring Metrics (Agent Installation Required)