The page has been translated by Gen AI.

Monitoring Metrics

Multi-node GPU Cluster monitoring metrics

The following table shows the monitoring metrics of Multi-node GPU Cluster that can be checked through Cloud Monitoring.

Guide
Multi-node GPU Cluster requires the user to install the Agent through the guide to view monitoring metrics. Please install the Agent before using the stable service. For the Agent installation method and detailed Cloud Monitoring usage, please refer to the Cloud Monitoring guide.

Multi-node GPU Cluster [Cluster]

Performance ItemDetailed DescriptionUnit
Memory Total [Basic]Available memory bytesbytes
Memory Used [Basic]Currently used memory bytesbytes
Memory Swap In [Basic]Replaced memory bytesbytes
Memory Swap Out [Basic]Swapped out memory bytesbytes
Memory Free [Basic]Unused memory bytesbytes
Disk Read Bytes [Basic]Read bytesbytes
Disk Read Requests [Basic]Number of Read Requestscnt
Disk Write Bytes [Basic]Write bytesbytes
Disk Write Requests [Basic]Write Request Countcnt
CPU Usage [Basic]1-minute average system CPU usage rate%
Instance State [Basic]Instance Statusstate
Network In Bytes [Basic]Received bytesbytes
Network In Dropped [Basic]Receive Packet Dropcnt
Network In Packets [Basic]Received Packet Countcnt
Network Out Bytes [Basic]Transmission bytesbytes
Network Out Dropped [Basic]Transmission Packet Dropcnt
Network Out Packets [Basic]Transmission Packet Countcnt
Fig. Multi-node GPU Cluster [Cluster] Monitoring Metrics (Default Provided)
Performance ItemDetailed DescriptionUnit
Cluster GPU CountCluster 내 GPU Count SUM
  • Sum value of GPU Count of nodes in the cluster: Calculate the sum of GPU Count of each node in the same GPU CLUSTER
cnt
Cluster GPU Count In UseCluster 내 Job이 수행중인 GPU 수
  • Cluster internal Process using GPU count: the sum of the number of GPUs held by the process by parsing the ‘Processes:’ information at the bottom of the nvidia-smi result of the nodes in the same GPU CLUSTER
cnt
Cluster GPU UsageCluster internal GPU Utilization AVG
  • Cluster internal node GPU utilization Average value: Average calculation of each node’s GPU utilization value among nodes in the same GPU CLUSTER
%
Cluster GPU Memory Usage [Avg]Cluster GPU Memory Utilization AVG
  • Cluster node Memory utilization Average value: Average calculation of Memory utilization values of each node in the same GPU cluster
%
Table. Multi-node GPU Cluster [Cluster] additional monitoring metrics (Agent installation required)

Multi-node GPU Cluster [Node]

Performance ItemDetailed DescriptionUnit
Memory Total [Basic]Available memory bytesbytes
Memory Used [Basic]bytes of memory currently being usedbytes
Memory Swap In [Basic]Replaced memory bytesbytes
Memory Swap Out [Basic]Swapped out memory bytesbytes
Memory Free [Basic]unused memory bytesbytes
Disk Read Bytes [Basic]Read bytesbytes
Disk Read Requests [Basic] Disk Read Requests Countcnt
Disk Write Bytes [Basic]Write bytesbytes
Disk Write Requests [Basic]Write Request Countcnt
CPU Usage [Basic]1-minute average system CPU usage rate%
Instance State [Basic]Instance Statusstate
Network In Bytes [Basic]Received bytesbytes
Network In Dropped [Basic]Received Packet Dropcnt
Network In Packets [Basic]Received Packet Countcnt
Network Out Bytes [Basic]Transmission bytesbytes
Network Out Dropped [Basic]Transmission Packet Dropcnt
Network Out Packets [Basic]Transmission packet countcnt
Fig. Multi-node GPU Cluster [Node] Monitoring Metrics (Default Provided)
Performance ItemDetailed DescriptionUnit
GPU CountGPU countcnt
GPU TemperatureGPU temperature
GPU Usageutilization%
GPU Usage [Avg]GPU Overall Average Usage Rate(%)%
GPU Power CapGPU’s maximum power capacityW
GPU Power UsageGPU’s current power usageW
GPU Memory Usage [Avg]GPU Memory Uti. AVG%
GPU Count in useNode’s Job-running GPU countcnt
Execution Status for nvidia-sminvidia-smi command execution resultstatus
Core Usage [IO Wait]The ratio of CPU time spent in waiting state (disk waiting)%
Core Usage [System]The ratio of CPU time spent in kernel space%
-Core Usage [User]--The ratio of CPU time spent in user space--%-
CPU CoresThe number of CPU cores on the host. The maximum value of the unnormalized ratio is 100%* of the cores. The unnormalized ratio already reflects this value, and the maximum value is 100%* of the cores.cnt
CPU Usage [Active]Percentage of CPU time used excluding Idle and IOWait states (if all 4 cores are used at 100%: 400%)%
CPU Usage [Idle]The ratio of CPU time spent in idle state.%
CPU Usage [IO Wait]The percentage of CPU time spent in waiting state (disk waiting)%
CPU Usage [System]Percentage of CPU time used by the kernel (in case of using all 4 cores 100%: 400%)%
CPU Usage [User]Percentage of CPU time used in the user area. (In case of using all 4 cores 100%, 400%)%
CPU Usage/Core [Active]Percentage of CPU time used excluding Idle and IOWait states (normalized value by number of cores, 100% if all 4 cores are used at 100%)%
CPU Usage/Core [Idle]The ratio of CPU time spent in idle state.%
CPU Usage/Core [IO Wait]The ratio of CPU time spent in waiting state (disk waiting)%
CPU Usage/Core [System]Percentage of CPU time used by the kernel (normalized value by number of cores, 100% if all 4 cores are used at 100%)%
CPU Usage/Core [User]Percentage of CPU time used in the user area. (normalized value by number of cores, 100% if all 4 cores are used at 100%)%
Disk CPU Usage [IO Request]The ratio of CPU time spent executing input/output requests for the device (device bandwidth utilization). If this value is close to 100%, the device is in a saturated state.%
Disk Queue Size [Avg]The average queue length of requests executed for the device.num
Disk Read BytesThe number of bytes read from the device per second.bytes
Disk Read Bytes [Delta Avg]Average of system.diskio.read.bytes_delta for each diskbytes
Disk Read Bytes [Delta Max]Individual disks’ system.diskio.read.bytes_delta maximumbytes
Disk Read Bytes [Delta Min]Individual disks’ minimum system.diskio.read.bytes_deltabytes
Disk Read Bytes [Delta Sum]The sum of system.diskio.read.bytes_delta of individual disksbytes
Disk Read Bytes [Delta]Delta value of system.diskio.read.bytes for each diskbytes
Disk Read Bytes [Success]The total number of bytes read successfully. In Linux, it is assumed that the sector size is 512 and the value multiplied by the number of sectors read by 512bytes
Disk Read RequestsThe number of read requests for the disk device in 1 secondcnt
Disk Read Requests [Delta Avg]Average of system.diskio.read.count_delta for each diskcnt
Disk Read Requests [Delta Max]Maximum of system.diskio.read.count_delta for individual diskscnt
Disk Read Requests [Delta Min]Minimum of system.diskio.read.count_delta for each diskcnt
Disk Read Requests [Delta Sum]Sum of system.diskio.read.count_delta of individual diskscnt
Disk Read Requests [Success Delta]Individual disk’s system.diskio.read.count deltacnt
Disk Read Requests [Success]Total number of successful read completionscnt
Disk Request Size [Avg]The average size of requests executed for the device (unit: sector)num
Disk Service Time [Avg]The average service time (in milliseconds) for input requests executed on the device.ms
Disk Wait Time [Avg]The average time spent on requests executed for supported devices.ms
Disk Wait Time [Read]Disk Average Wait Timems
Disk Wait Time [Write]Disk Average Wait Timems
Disk Write Bytes [Delta Avg]Average of system.diskio.write.bytes_delta for each diskbytes
Disk Write Bytes [Delta Max]Maximum of system.diskio.write.bytes_delta for each diskbytes
Disk Write Bytes [Delta Min]Individual disks’ minimum system.diskio.write.bytes_deltabytes
Disk Write Bytes [Delta Sum]The sum of system.diskio.write.bytes_delta of individual disksbytes
Disk Write Bytes [Delta]Delta value of system.diskio.write.bytes for each diskbytes
Disk Write Bytes [Success]The total number of bytes written successfully. In Linux, it is assumed that the sector size is 512 and the value is multiplied by 512 to the number of sectors writtenbytes
Disk Write RequestsThe number of write requests to the disk device for 1 secondcnt
Disk Write Requests [Delta Avg]Average of system.diskio.write.count_delta of individual diskscnt
Disk Write Requests [Delta Max]Maximum of system.diskio.write.count_delta for each diskcnt
Disk Write Requests [Delta Min]Minimum of system.diskio.write.count_delta for individual diskscnt
Disk Write Requests [Delta Sum]Sum of system.diskio.write.count_delta of individual diskscnt
Disk Write Requests [Success Delta]Individual disk’s system.diskio.write.count deltacnt
Disk Write Requests [Success]Total number of writes completed successfullycnt
Disk Writes BytesThe number of bytes written to the device per second.bytes
Filesystem Hang Checkfilesystem(local/NFS) hang check (normal:1, abnormal:0)status
Filesystem NodesThe total number of file nodes in the file system.cnt
Filesystem Nodes [Free]The total number of available file nodes in the file system.cnt
Filesystem Size [Available]This is the disk space (in bytes) that can be used by unauthorized users.bytes
Filesystem Size [Free]Available disk space (bytes)bytes
Filesystem Size [Total]Total Disk Space (bytes)bytes
Filesystem UsageUsed Disk Space Percentage%
Filesystem Usage [Avg]Average of individual filesystem.used.pct%
Filesystem Usage [Inode]_inode usage rate%
Filesystem Usage [Max]Maximum value among individual filesystem usage percentages%
Filesystem Usage [Min]Minimum of individual filesystem used percentages%
Filesystem Usage [Total]-%
Filesystem UsedUsed Disk Space (bytes)bytes
Filesystem Used [Inode]Inode usagebytes
Memory FreeThe total amount of available memory (bytes). It does not include memory used by system cache and buffers (see system.memory.actual.free).bytes
Memory Free [Actual]Actual available memory (bytes). The calculation method varies depending on the OS, and in Linux, it is either MemAvailable from /proc/meminfo or calculated from available memory, cache, and buffer if meminfo is not available. On OSX, it is the sum of available memory and inactive memory. On Windows, it is the same as system.memory.free.bytes
Memory Free [Swap]Available swap memory.bytes
Memory TotalTotal Memorybytes
Memory Total [Swap]Total swap memory.bytes
Memory UsageUsed memory percentage
  • ((Memory Total - Memory Free) / Memory Total) * 100
  • Memory Free: Current available free memory capacity
%
Memory Usage [Actual]The percentage of memory actually used
  • ((Memory Total - Memory Available) / Memory Total) * 100 or ((Memory Total - (Memory Free + Buffers + Cached)) / Memory Total) * 100
  • Memory Free: The capacity of free memory currently available
  • Buffers: The capacity of memory used by buffers
  • Cached: The capacity of memory used by page cache
%
Memory Usage [Cache Swap]Cache swap usage rate%
Memory Usage [Swap]Used swap memory percentage%
Memory UsedUsed Memorybytes
Memory Used [Actual]Actual used memory (bytes). The value subtracted from the total memory by the used memory. The available memory is calculated differently depending on the OS (refer to system.actual.free)bytes
Memory Used [Swap]Used swap memory.bytes
CollisionsNetwork Collisionscnt
Network In BytesReceived byte countbytes
Network In Bytes [Delta Avg]Average of system.network.in.bytes_delta for each networkbytes
Network In Bytes [Delta Max]Maximum of system.network.in.bytes_delta for each networkbytes
Network In Bytes [Delta Min]Minimum of system.network.in.bytes_delta for each networkbytes
Network In Bytes [Delta Sum]Sum of each network’s system.network.in.bytes_deltabytes
Network In Bytes [Delta]Received byte count deltabytes
Network In DroppedThe number of packets deleted among incoming packetscnt
Network In ErrorsNumber of errors during receptioncnt
Network In PacketsReceived packet countcnt
Network In Packets [Delta Avg]Average of system.network.in.packets_delta for each networkcnt
Network In Packets [Delta Max]Individual networks’ system.network.in.packets_delta maximumcnt
Network In Packets [Delta Min]Minimum of system.network.in.packets_delta for each networkcnt
Network In Packets [Delta Sum]Sum of system.network.in.packets_delta of individual networkscnt
Network In Packets [Delta]Received packet count deltacnt
Network Out BytesTransmitted byte countbytes
Network Out Bytes [Delta Avg]Average of system.network.out.bytes_delta for each networkbytes
Network Out Bytes [Delta Max]Individual networks’ system.network.out.bytes_delta maximumbytes
Network Out Bytes [Delta Min]Minimum of system.network.out.bytes_delta for each networkbytes
Network Out Bytes [Delta Sum]The sum of system.network.out.bytes_delta of individual networksbytes
Network Out Bytes [Delta]Transmitted byte count deltabytes
Network Out DroppedNumber of packets dropped among outgoing packets. This value is not reported by the operating system, so it is always 0 in Darwin and BSDcnt
Network Out ErrorsNumber of errors during transmissioncnt
Network Out PacketsNumber of transmitted packetscnt
Network Out Packets [Delta Avg]Average of system.network.out.packets_delta for each networkcnt
Network Out Packets [Delta Max]Maximum of system.network.out.packets_delta for each networkcnt
Network Out Packets [Delta Min]Individual networks’ minimum system.network.out.packets_deltacnt
Network Out Packets [Delta Sum]Sum of system.network.out.packets_delta of individual networkscnt
Network Out Packets [Delta]Number of transmitted packets deltacnt
Open Connections [TCP]All open TCP connectionscnt
Open Connections [UDP]All open UDP connectionscnt
Port UsagePort usage available for connection%
SYN Sent SocketsNumber of sockets in SYN_SENT state (when connecting from local to remote)cnt
Kernel PID Maxkernel.pid_max valuecnt
Kernel Thread Maxkernel threads-max valuecnt
Process CPU UsagePercentage of CPU time consumed by the process after the last update. This value is similar to the %CPU value of the process displayed by the top command on Unix systems%
Process CPU Usage/CorePercentage of CPU time used by the process since the last event, normalized by the number of cores, with a value between 0~100%%
Process Memory Usagemain memory (RAM) where the process occupies a ratio%
Process Memory UsedResident Set size. The amount of memory a process occupies in RAM. In Windows, it is the current working set sizebytes
Process PIDProcess PIDPID
Process PPIDParent process’s pidPID
Processes [Dead].dead processes countcnt
Processes [Idle]idle process countcnt
Processes [Running]Number of running processescnt
Processes [Sleeping]sleeping processes countcnt
Processes [Stopped]Number of stopped processescnt
Processes [Total]Total number of processescnt
Processes [Unknown]Cannot search or unknown number of processescnt
Processes [Zombie]Number of zombie processescnt
Running Process Usageprocess usage rate%
Running ProcessesNumber of running processescnt
Running Thread UsageThread usage rate%
Running Threadsnumber of threads running in running processescnt
Instance Status_instance statusstate
Context Switchescontext switch count (per second)cnt
Load/Core [1 min]Load for the last 1 minute divided by the number of corescnt
Load/Core [15 min]The value of load divided by the number of cores for the last 15 minutescnt
Load/Core [5 min]The value of load divided by the number of cores over the last 5 minutescnt
Multipaths [Active]External storage connection path status = active countcnt
Multipaths [Failed]External storage connection path status = failed countcnt
Multipaths [Faulty]External storage connection path status = faulty countcnt
NTP Offsetlast sample’s measured offset (time difference between NTP server and local environment)num
Run Queue LengthExecution Waiting Queue Lengthnum
UptimeOS operation time (uptime). (milliseconds)ms
Context SwitchiesCPU context switch count (per second)cnt
Disk Read Bytes [Sec]number of bytes read from the windows logical disk in 1 secondcnt
Disk Read Time [Avg]Data Read Average Time (sec)sec
Disk Transfer Time [Avg]Disk average wait timesec
Disk UsageDisk Usage Rate%
Disk Write Bytes [Sec]number of bytes written to the windows logical disk in 1 secondcnt
Disk Write Time [Avg]Data Write Average Time (sec)sec
Pagingfile UsagePaging file usage rate%
Pool Used [Non Paged]Kernel memory Non-paged pool usagebytes
Pool Used [Paged]Paged Pool usage among kernel memorybytes
Process [Running]The number of processes currently runningcnt
Threads [Running]Number of threads currently runningcnt
Threads [Waiting]The number of threads waiting for processor timecnt
Table. Multi-node GPU Cluster [Node] additional monitoring metrics (Agent installation required)
Server Type
How-to guides