This is the multi-page printable view of this section. Click here to print.
Multi-node GPU Cluster
1 - Overview
Service Overview
Multi-node GPU Cluster is a service that provides physical GPU servers without virtualization for large-scale high-performance AI calculations. It can cluster multiple GPUs using two or more bare metal servers with GPUs, and can be used conveniently with Samsung Cloud Platform’s high-performance storage and networking services.
Provided Features
Multi-node GPU Cluster provides the following functions.
- Auto Provisioning and Management: Through the web-based Console, you can easily use the standard GPU Bare Metal model server with 8 GPU cards, from provisioning to resource and cost management.
- Network Connection: Two or more Bare Metal Servers can be clustered through high-speed interconnects to process multiple GPUs, and by configuring the GPU Direct RDMA (Remote Direct Memory Access) environment, direct data IO between GPU memories is possible, enabling high-speed AI/Machine Learning calculations.
- Storage Connection: It provides various additional connection storages other than OS disks. High-speed network and high-performance SSD NAS File Storage, Block Storage, and Object Storage that are directly linked can also be used in conjunction.
- Network Setting Management: The server’s subnet/IP can be easily changed with the initially set value. NAT IP provides a management function that can be used or cancelled according to needs.
- Monitoring: You can check the monitoring information of computing resources such as CPU, GPU, Memory, Disk, etc. through Cloud Monitoring. To use the Cloud Monitoring service for Multi-node GPU Cluster, you need to install the Agent. Please install the Agent for stable service use. For more information, please refer to Multi-node GPU Cluster Monitoring Metrics.
Component
Multi-node GPU Cluster provides GPU as a Bare Metal Server type with standard images and server types, and NVSwitch and NVLink are provided.
GPU(H100)
GPU (Graphic Processing Unit) is specialized in parallel calculations that can process a large amount of data quickly, enabling large-scale parallel calculation processing in fields such as artificial intelligence (AI) and data analysis.
The following are the specifications of the GPU Type provided by the Multi-node GPU Cluster service.
| Classification | H100 Type |
|---|---|
| Product Provisioning Method | Bare Metal |
| GPU Architecture | NNVIDIA Hopper |
| GPU Memory | 80GB |
| GPU Transistors | 80 billion 4N TSMC |
| GPU Tensor Performance(based on FP16) | 989.4 TFLOPs, 1,978.9 TFLOPs* |
| GPU Memory Bandwidth | 3,352 GB/sec HBM3 |
| GPU CUDA Cores | 16,896 Cores |
| GPU Tensor Cores | 528(4th Generation) |
| NVLink performance | NVLink 4 |
| Total NVLink bandwidth | 900 GB/s |
| NVLink Signaling Rate | 25 Gbps (x18) |
| NVSwitch performance | NVSwitch 3 |
| NVSwitch GPU bandwidth | 900 GB/s |
| Total NVSwitch Aggregate Bandwidth | 7.2TB/s |
- With Sparsity
OS and GPU Driver Version
The operating systems (OS) supported by Multi-node GPU Cluster are as follows.
| OS | OS version | GPU driver version |
|---|---|---|
| Ubuntu | 22.04 | 535.86.10, 535.183.06 |
Server Type
The server types provided by Multi-node GPU Cluster are as follows. For a detailed description of the server types provided by Multi-node GPU Cluster, please refer to Multi-node GPU Cluster server type.
g2c96h8_metal
| Classification | Example | Detailed Description |
|---|---|---|
| Server Generation | g2 | Provided server generation
|
| CPU | c96 | Number of Cores
|
| GPU | h8 | GPU type and quantity
|
Preceding Service
This is a list of services that must be pre-configured before creating this service. Please refer to the guide provided for each service and prepare in advance for more details.
| Service Category | Service | Detailed Description |
|---|---|---|
| Networking | VPC | A service that provides an independent virtual network in a cloud environment |
1.1 - Server Type
Multi-node GPU Cluster Server Type
Multi-node GPU Cluster is divided based on the provided GPU Type, and the GPU used in the Multi-node GPU Cluster is determined by the server type selected when creating a GPU Node. Please select the server type according to the specifications of the application you want to run in the Multi-node GPU Cluster.
The server types supported by Multi-node GPU Cluster are in the following format:
g2c96h8_metal
| Classification | Example | Detailed Description |
|---|---|---|
| Server Generation | g2 | Provided server generation
|
| CPU | c96 | Number of cores
|
| GPU | h8 | GPU type and quantity
|
g2 Server Type
The g2 server type is a GPU Bare Metal Server using NVIDIA H100 Tensor Core GPU, suitable for large-scale high-performance AI computing.
- Provides up to 8 NVIDIA H100 Tensor Core GPUs
- Each GPU has 16,896 CUDA cores and 528 Tensor cores
- Supports up to 96 vCPUs and 1,920 GB of memory
- Supports up to 100 Gbps networking speed
- 900GB/s GPU and NVIDIA NVSwitch P2P communication
| Server Type | GPU | GPU Memory | CPU(Core) | Memory | Disk | GPU P2P |
|---|---|---|---|---|---|---|
| g2c96h8_metal | H100 | 640 GB | 96 vCore | 2 TB | SSD(OS) 960 GB * 2, NVMeSSD 3.84 TB * 4 | 900GB/s NVSwitch |
1.2 - Monitoring Metrics
Multi-node GPU Cluster monitoring metrics
The following table shows the monitoring metrics of Multi-node GPU Cluster that can be checked through Cloud Monitoring.
Multi-node GPU Cluster [Cluster]
| Performance Item | Detailed Description | Unit |
|---|---|---|
| Memory Total [Basic] | Available memory bytes | bytes |
| Memory Used [Basic] | Currently used memory bytes | bytes |
| Memory Swap In [Basic] | Replaced memory bytes | bytes |
| Memory Swap Out [Basic] | Swapped out memory bytes | bytes |
| Memory Free [Basic] | Unused memory bytes | bytes |
| Disk Read Bytes [Basic] | Read bytes | bytes |
| Disk Read Requests [Basic] | Number of Read Requests | cnt |
| Disk Write Bytes [Basic] | Write bytes | bytes |
| Disk Write Requests [Basic] | Write Request Count | cnt |
| CPU Usage [Basic] | 1-minute average system CPU usage rate | % |
| Instance State [Basic] | Instance Status | state |
| Network In Bytes [Basic] | Received bytes | bytes |
| Network In Dropped [Basic] | Receive Packet Drop | cnt |
| Network In Packets [Basic] | Received Packet Count | cnt |
| Network Out Bytes [Basic] | Transmission bytes | bytes |
| Network Out Dropped [Basic] | Transmission Packet Drop | cnt |
| Network Out Packets [Basic] | Transmission Packet Count | cnt |
| Performance Item | Detailed Description | Unit |
|---|---|---|
| Cluster GPU Count | Cluster 내 GPU Count SUM
| cnt |
| Cluster GPU Count In Use | Cluster 내 Job이 수행중인 GPU 수
| cnt |
| Cluster GPU Usage | Cluster internal GPU Utilization AVG
| % |
| Cluster GPU Memory Usage [Avg] | Cluster GPU Memory Utilization AVG
| % |
Multi-node GPU Cluster [Node]
| Performance Item | Detailed Description | Unit |
|---|---|---|
| Memory Total [Basic] | Available memory bytes | bytes |
| Memory Used [Basic] | bytes of memory currently being used | bytes |
| Memory Swap In [Basic] | Replaced memory bytes | bytes |
| Memory Swap Out [Basic] | Swapped out memory bytes | bytes |
| Memory Free [Basic] | unused memory bytes | bytes |
| Disk Read Bytes [Basic] | Read bytes | bytes |
| Disk Read Requests [Basic] Disk Read Requests Count | cnt | |
| Disk Write Bytes [Basic] | Write bytes | bytes |
| Disk Write Requests [Basic] | Write Request Count | cnt |
| CPU Usage [Basic] | 1-minute average system CPU usage rate | % |
| Instance State [Basic] | Instance Status | state |
| Network In Bytes [Basic] | Received bytes | bytes |
| Network In Dropped [Basic] | Received Packet Drop | cnt |
| Network In Packets [Basic] | Received Packet Count | cnt |
| Network Out Bytes [Basic] | Transmission bytes | bytes |
| Network Out Dropped [Basic] | Transmission Packet Drop | cnt |
| Network Out Packets [Basic] | Transmission packet count | cnt |
| Performance Item | Detailed Description | Unit |
|---|---|---|
| GPU Count | GPU count | cnt |
| GPU Temperature | GPU temperature | ℃ |
| GPU Usage | utilization | % |
| GPU Usage [Avg] | GPU Overall Average Usage Rate(%) | % |
| GPU Power Cap | GPU’s maximum power capacity | W |
| GPU Power Usage | GPU’s current power usage | W |
| GPU Memory Usage [Avg] | GPU Memory Uti. AVG | % |
| GPU Count in use | Node’s Job-running GPU count | cnt |
| Execution Status for nvidia-smi | nvidia-smi command execution result | status |
| Core Usage [IO Wait] | The ratio of CPU time spent in waiting state (disk waiting) | % |
| Core Usage [System] | The ratio of CPU time spent in kernel space | % |
| -Core Usage [User]- | -The ratio of CPU time spent in user space- | -%- |
| CPU Cores | The number of CPU cores on the host. The maximum value of the unnormalized ratio is 100%* of the cores. The unnormalized ratio already reflects this value, and the maximum value is 100%* of the cores. | cnt |
| CPU Usage [Active] | Percentage of CPU time used excluding Idle and IOWait states (if all 4 cores are used at 100%: 400%) | % |
| CPU Usage [Idle] | The ratio of CPU time spent in idle state. | % |
| CPU Usage [IO Wait] | The percentage of CPU time spent in waiting state (disk waiting) | % |
| CPU Usage [System] | Percentage of CPU time used by the kernel (in case of using all 4 cores 100%: 400%) | % |
| CPU Usage [User] | Percentage of CPU time used in the user area. (In case of using all 4 cores 100%, 400%) | % |
| CPU Usage/Core [Active] | Percentage of CPU time used excluding Idle and IOWait states (normalized value by number of cores, 100% if all 4 cores are used at 100%) | % |
| CPU Usage/Core [Idle] | The ratio of CPU time spent in idle state. | % |
| CPU Usage/Core [IO Wait] | The ratio of CPU time spent in waiting state (disk waiting) | % |
| CPU Usage/Core [System] | Percentage of CPU time used by the kernel (normalized value by number of cores, 100% if all 4 cores are used at 100%) | % |
| CPU Usage/Core [User] | Percentage of CPU time used in the user area. (normalized value by number of cores, 100% if all 4 cores are used at 100%) | % |
| Disk CPU Usage [IO Request] | The ratio of CPU time spent executing input/output requests for the device (device bandwidth utilization). If this value is close to 100%, the device is in a saturated state. | % |
| Disk Queue Size [Avg] | The average queue length of requests executed for the device. | num |
| Disk Read Bytes | The number of bytes read from the device per second. | bytes |
| Disk Read Bytes [Delta Avg] | Average of system.diskio.read.bytes_delta for each disk | bytes |
| Disk Read Bytes [Delta Max] | Individual disks’ system.diskio.read.bytes_delta maximum | bytes |
| Disk Read Bytes [Delta Min] | Individual disks’ minimum system.diskio.read.bytes_delta | bytes |
| Disk Read Bytes [Delta Sum] | The sum of system.diskio.read.bytes_delta of individual disks | bytes |
| Disk Read Bytes [Delta] | Delta value of system.diskio.read.bytes for each disk | bytes |
| Disk Read Bytes [Success] | The total number of bytes read successfully. In Linux, it is assumed that the sector size is 512 and the value multiplied by the number of sectors read by 512 | bytes |
| Disk Read Requests | The number of read requests for the disk device in 1 second | cnt |
| Disk Read Requests [Delta Avg] | Average of system.diskio.read.count_delta for each disk | cnt |
| Disk Read Requests [Delta Max] | Maximum of system.diskio.read.count_delta for individual disks | cnt |
| Disk Read Requests [Delta Min] | Minimum of system.diskio.read.count_delta for each disk | cnt |
| Disk Read Requests [Delta Sum] | Sum of system.diskio.read.count_delta of individual disks | cnt |
| Disk Read Requests [Success Delta] | Individual disk’s system.diskio.read.count delta | cnt |
| Disk Read Requests [Success] | Total number of successful read completions | cnt |
| Disk Request Size [Avg] | The average size of requests executed for the device (unit: sector) | num |
| Disk Service Time [Avg] | The average service time (in milliseconds) for input requests executed on the device. | ms |
| Disk Wait Time [Avg] | The average time spent on requests executed for supported devices. | ms |
| Disk Wait Time [Read] | Disk Average Wait Time | ms |
| Disk Wait Time [Write] | Disk Average Wait Time | ms |
| Disk Write Bytes [Delta Avg] | Average of system.diskio.write.bytes_delta for each disk | bytes |
| Disk Write Bytes [Delta Max] | Maximum of system.diskio.write.bytes_delta for each disk | bytes |
| Disk Write Bytes [Delta Min] | Individual disks’ minimum system.diskio.write.bytes_delta | bytes |
| Disk Write Bytes [Delta Sum] | The sum of system.diskio.write.bytes_delta of individual disks | bytes |
| Disk Write Bytes [Delta] | Delta value of system.diskio.write.bytes for each disk | bytes |
| Disk Write Bytes [Success] | The total number of bytes written successfully. In Linux, it is assumed that the sector size is 512 and the value is multiplied by 512 to the number of sectors written | bytes |
| Disk Write Requests | The number of write requests to the disk device for 1 second | cnt |
| Disk Write Requests [Delta Avg] | Average of system.diskio.write.count_delta of individual disks | cnt |
| Disk Write Requests [Delta Max] | Maximum of system.diskio.write.count_delta for each disk | cnt |
| Disk Write Requests [Delta Min] | Minimum of system.diskio.write.count_delta for individual disks | cnt |
| Disk Write Requests [Delta Sum] | Sum of system.diskio.write.count_delta of individual disks | cnt |
| Disk Write Requests [Success Delta] | Individual disk’s system.diskio.write.count delta | cnt |
| Disk Write Requests [Success] | Total number of writes completed successfully | cnt |
| Disk Writes Bytes | The number of bytes written to the device per second. | bytes |
| Filesystem Hang Check | filesystem(local/NFS) hang check (normal:1, abnormal:0) | status |
| Filesystem Nodes | The total number of file nodes in the file system. | cnt |
| Filesystem Nodes [Free] | The total number of available file nodes in the file system. | cnt |
| Filesystem Size [Available] | This is the disk space (in bytes) that can be used by unauthorized users. | bytes |
| Filesystem Size [Free] | Available disk space (bytes) | bytes |
| Filesystem Size [Total] | Total Disk Space (bytes) | bytes |
| Filesystem Usage | Used Disk Space Percentage | % |
| Filesystem Usage [Avg] | Average of individual filesystem.used.pct | % |
| Filesystem Usage [Inode] | _inode usage rate | % |
| Filesystem Usage [Max] | Maximum value among individual filesystem usage percentages | % |
| Filesystem Usage [Min] | Minimum of individual filesystem used percentages | % |
| Filesystem Usage [Total] | - | % |
| Filesystem Used | Used Disk Space (bytes) | bytes |
| Filesystem Used [Inode] | Inode usage | bytes |
| Memory Free | The total amount of available memory (bytes). It does not include memory used by system cache and buffers (see system.memory.actual.free). | bytes |
| Memory Free [Actual] | Actual available memory (bytes). The calculation method varies depending on the OS, and in Linux, it is either MemAvailable from /proc/meminfo or calculated from available memory, cache, and buffer if meminfo is not available. On OSX, it is the sum of available memory and inactive memory. On Windows, it is the same as system.memory.free. | bytes |
| Memory Free [Swap] | Available swap memory. | bytes |
| Memory Total | Total Memory | bytes |
| Memory Total [Swap] | Total swap memory. | bytes |
| Memory Usage | Used memory percentage
| % |
| Memory Usage [Actual] | The percentage of memory actually used
| % |
| Memory Usage [Cache Swap] | Cache swap usage rate | % |
| Memory Usage [Swap] | Used swap memory percentage | % |
| Memory Used | Used Memory | bytes |
| Memory Used [Actual] | Actual used memory (bytes). The value subtracted from the total memory by the used memory. The available memory is calculated differently depending on the OS (refer to system.actual.free) | bytes |
| Memory Used [Swap] | Used swap memory. | bytes |
| Collisions | Network Collisions | cnt |
| Network In Bytes | Received byte count | bytes |
| Network In Bytes [Delta Avg] | Average of system.network.in.bytes_delta for each network | bytes |
| Network In Bytes [Delta Max] | Maximum of system.network.in.bytes_delta for each network | bytes |
| Network In Bytes [Delta Min] | Minimum of system.network.in.bytes_delta for each network | bytes |
| Network In Bytes [Delta Sum] | Sum of each network’s system.network.in.bytes_delta | bytes |
| Network In Bytes [Delta] | Received byte count delta | bytes |
| Network In Dropped | The number of packets deleted among incoming packets | cnt |
| Network In Errors | Number of errors during reception | cnt |
| Network In Packets | Received packet count | cnt |
| Network In Packets [Delta Avg] | Average of system.network.in.packets_delta for each network | cnt |
| Network In Packets [Delta Max] | Individual networks’ system.network.in.packets_delta maximum | cnt |
| Network In Packets [Delta Min] | Minimum of system.network.in.packets_delta for each network | cnt |
| Network In Packets [Delta Sum] | Sum of system.network.in.packets_delta of individual networks | cnt |
| Network In Packets [Delta] | Received packet count delta | cnt |
| Network Out Bytes | Transmitted byte count | bytes |
| Network Out Bytes [Delta Avg] | Average of system.network.out.bytes_delta for each network | bytes |
| Network Out Bytes [Delta Max] | Individual networks’ system.network.out.bytes_delta maximum | bytes |
| Network Out Bytes [Delta Min] | Minimum of system.network.out.bytes_delta for each network | bytes |
| Network Out Bytes [Delta Sum] | The sum of system.network.out.bytes_delta of individual networks | bytes |
| Network Out Bytes [Delta] | Transmitted byte count delta | bytes |
| Network Out Dropped | Number of packets dropped among outgoing packets. This value is not reported by the operating system, so it is always 0 in Darwin and BSD | cnt |
| Network Out Errors | Number of errors during transmission | cnt |
| Network Out Packets | Number of transmitted packets | cnt |
| Network Out Packets [Delta Avg] | Average of system.network.out.packets_delta for each network | cnt |
| Network Out Packets [Delta Max] | Maximum of system.network.out.packets_delta for each network | cnt |
| Network Out Packets [Delta Min] | Individual networks’ minimum system.network.out.packets_delta | cnt |
| Network Out Packets [Delta Sum] | Sum of system.network.out.packets_delta of individual networks | cnt |
| Network Out Packets [Delta] | Number of transmitted packets delta | cnt |
| Open Connections [TCP] | All open TCP connections | cnt |
| Open Connections [UDP] | All open UDP connections | cnt |
| Port Usage | Port usage available for connection | % |
| SYN Sent Sockets | Number of sockets in SYN_SENT state (when connecting from local to remote) | cnt |
| Kernel PID Max | kernel.pid_max value | cnt |
| Kernel Thread Max | kernel threads-max value | cnt |
| Process CPU Usage | Percentage of CPU time consumed by the process after the last update. This value is similar to the %CPU value of the process displayed by the top command on Unix systems | % |
| Process CPU Usage/Core | Percentage of CPU time used by the process since the last event, normalized by the number of cores, with a value between 0~100% | % |
| Process Memory Usage | main memory (RAM) where the process occupies a ratio | % |
| Process Memory Used | Resident Set size. The amount of memory a process occupies in RAM. In Windows, it is the current working set size | bytes |
| Process PID | Process PID | PID |
| Process PPID | Parent process’s pid | PID |
| Processes [Dead] | .dead processes count | cnt |
| Processes [Idle] | idle process count | cnt |
| Processes [Running] | Number of running processes | cnt |
| Processes [Sleeping] | sleeping processes count | cnt |
| Processes [Stopped] | Number of stopped processes | cnt |
| Processes [Total] | Total number of processes | cnt |
| Processes [Unknown] | Cannot search or unknown number of processes | cnt |
| Processes [Zombie] | Number of zombie processes | cnt |
| Running Process Usage | process usage rate | % |
| Running Processes | Number of running processes | cnt |
| Running Thread Usage | Thread usage rate | % |
| Running Threads | number of threads running in running processes | cnt |
| Instance Status | _instance status | state |
| Context Switches | context switch count (per second) | cnt |
| Load/Core [1 min] | Load for the last 1 minute divided by the number of cores | cnt |
| Load/Core [15 min] | The value of load divided by the number of cores for the last 15 minutes | cnt |
| Load/Core [5 min] | The value of load divided by the number of cores over the last 5 minutes | cnt |
| Multipaths [Active] | External storage connection path status = active count | cnt |
| Multipaths [Failed] | External storage connection path status = failed count | cnt |
| Multipaths [Faulty] | External storage connection path status = faulty count | cnt |
| NTP Offset | last sample’s measured offset (time difference between NTP server and local environment) | num |
| Run Queue Length | Execution Waiting Queue Length | num |
| Uptime | OS operation time (uptime). (milliseconds) | ms |
| Context Switchies | CPU context switch count (per second) | cnt |
| Disk Read Bytes [Sec] | number of bytes read from the windows logical disk in 1 second | cnt |
| Disk Read Time [Avg] | Data Read Average Time (sec) | sec |
| Disk Transfer Time [Avg] | Disk average wait time | sec |
| Disk Usage | Disk Usage Rate | % |
| Disk Write Bytes [Sec] | number of bytes written to the windows logical disk in 1 second | cnt |
| Disk Write Time [Avg] | Data Write Average Time (sec) | sec |
| Pagingfile Usage | Paging file usage rate | % |
| Pool Used [Non Paged] | Kernel memory Non-paged pool usage | bytes |
| Pool Used [Paged] | Paged Pool usage among kernel memory | bytes |
| Process [Running] | The number of processes currently running | cnt |
| Threads [Running] | Number of threads currently running | cnt |
| Threads [Waiting] | The number of threads waiting for processor time | cnt |
2 - How-to guides
The user can enter the required information for the Multi-node GPU Cluster service through the Samsung Cloud Platform Console, select detailed options, and create the service.
Multi-node GPU Cluster Getting Started
You can create and use a Multi-node GPU Cluster service in the Samsung Cloud Platform Console.
This service consists of GPU Node and Cluster Fabric services.
GPU Node Creation
To create a Multi-node GPU Cluster, follow the steps below.
- All Services > Compute > Multi-node GPU Cluster Click the menu. Navigate to the Service Home page of Multi-node GPU Cluster.
- Click the GPU Node creation button on the Service Home page. You will be taken to the GPU Node creation page.
- GPU Node creation on the page, enter the information required to create the service, and select detailed options.
- Image and Version Selection Select the required information in the area.
Category RequiredDetailed description Image Required Select provided image type - Ubuntu
Image Version Required Select version of the chosen image - Provides a list of versions of the provided server images
Table. GPU Node image and version selection items
- Image and Version Selection Select the required information in the area.
- Enter service information area, input or select the required information.
Category RequiredDetailed description Number of servers Required Number of GPU Node servers to create simultaneously - Only numbers can be entered, and the minimum number of servers to create is 2.
- Only during the initial setup can you create 2 or more, and expansion is possible one at a time.
Service Type > Server Type Required GPU Node Server Type - Select desired CPU, Memory, GPU, Disk specifications
- For detailed information about the server types provided by GPU Node, refer to Multi-node GPU Cluster Server Type
Service Type > Planned Compute Required Status of resources with Planned Compute set - In Use: Number of resources with Planned Compute that are currently in use
- Configured: Number of resources with Planned Compute set
- Coverage Preview: Amount applied per resource by Planned Compute
- Planned Compute Service Application: Go to the Planned Compute service application page
- For details, refer to Apply for Planned Compute
Table. GPU Node Service Information Input Items- Required Information Input area, enter or select the required information.
Category Required or notDetailed description Administrator Account Required Set the administrator account and password to be used when connecting to the server - Ubuntu OS is provided fixed as root
Server Name Prefix Required Enter a Prefix to distinguish each GPU Node generated when the number of selected servers is 2 or more - Automatically generated as user input value (prefix) + ‘
-###’ format
- Start with a lowercase English letter, and use lowercase letters, numbers, and special characters (
-) within 3 to 11 characters
- Must not end with a special character (
-)
Network Settings Required Set the network where the GPU Node will be installed - VPC Name:Select a pre-created VPC
- General Subnet Name: Select a pre-created general Subnet
- IP can be set to auto-generate or user input, and if input is selected, the user enters the IP directly
- NAT: Can be used only when there is 1 server and the VPC has an Internet Gateway attached. Checking ‘use’ allows selection of a NAT IP. (When first created, it is generated only with 2 or more servers, so modify on the resource detail page)
- NAT IP: Select NAT IP
- If there is no NAT IP to select, click the Create New button to generate a Public IP
- Click the Refresh button to view and select the created Public IP
- Creating a Public IP incurs charges according to the Public IP pricing policy
Table. GPU Node required information entry items
- Cluster selection area, create or select a Cluster Fabric.
Category RequiredDetailed description Cluster Fabric Required Setting of a group of GPU Node servers that can apply GPU Direct RDMA together - Optimal GPU performance and speed can be secured only within the same Cluster Fabric
- When creating a new Cluster Fabric, select *New Input > Node pool, then enter the name of the Cluster Fabric to be created
- To add to an existing Cluster Fabric, select Existing Input > Node pool, then select the already created Cluster Fabric
Table. GPU Node Cluster Fabric selection items- Additional Information Input Enter or select the required information in the area.
Category Required or notDetailed description Lock Select Using Lock prevents accidental actions that could terminate/start/stop the server Init Script Select Script to run when the server starts - Init Script must be selected differently depending on the image type
- For Linux: Select Shell Script or cloud-init
Tag Select Add Tag - Up to 50 can be added per resource
- After clicking the Add Tag button, enter or select Key, Value values
Table. GPU Node additional information input items - Init Script must be selected differently depending on the image type
- Summary Check the detailed information and estimated billing amount generated in the panel, and click the Complete button.
- Once creation is complete, check the created resources on the GPU Node List page.
- When creating a service, the GPU MIG/ECC settings are reset. However, to apply the correct settings, perform a one-time reboot initially, verify whether the settings have been applied, and then use it.
- For detailed information about GPU MIG/ECC settings reset, please refer to GPU MIG/ECC Settings Reset Checklist Guide.
GPU Node Check Detailed Information
The Multi-node GPU Cluster service allows you to view and edit the full list of GPU Node resources and detailed information.
GPU Node Details page consists of Details, Tags, Job History tabs.
To view detailed information of the GPU Node, follow the steps below.
All Services > Compute > Multi-node GPU Cluster > GPU Node Click the menu. Navigate to the Service Home page of Multi-node GPU Cluster.
Click the GPU Node menu on the Service Home page. Navigate to the GPU Node List page.
- Resource items other than required columns can be added via the Settings button.
Category Required or notDetailed description Resource ID Select User-created GPU Node ID Cluster Fabric name required Cluster Fabric name created by the user Server name Required User-created GPU Node name Server Type Required Server type of GPU Node - User can check the number of cores, memory capacity, GPU type and count of the created resources
Image Required User-generated GPU Node image version IP Required IP of the GPU Node created by the user Status Required Status of the GPU Node created by the user Creation Time Select GPU Node creation time Table. GPU Node Resource List Items
- Resource items other than required columns can be added via the Settings button.
GPU Node List Click the resource to view detailed information. GPU Node Details You will be taken to the page.
- GPU Server Details At the top of the page, status information and descriptions of additional features are displayed.
Category Detailed description GPU Node status Status of GPU Node created by the user - Creating: State where the server is being created
- Running:: State where creation is complete and usable
- Editing:: State where IP is being changed
- Unknown: Error state
- Starting: State where the server is starting
- Stopping: State where the server is stopping
- Stopped: State where the server has stopped
- Terminating: State where termination is in progress
- Terminated: State where termination is complete
Server Control Button to change server status - Start: Start a stopped server
- Stop: Stop a running server
Service cancellation Button to cancel the service Table. GPU Node status information and additional features
- GPU Server Details At the top of the page, status information and descriptions of additional features are displayed.
Detailed Information
GPU Node List page’s Details Tab you can view the detailed information of the selected resource, and if necessary, edit the information.
| Category | Detailed description |
|---|---|
| Service | Service Name |
| Resource Type | Resource Type |
| SRN | Unique resource ID in Samsung Cloud Platform
|
| Resource Name | Resource Name
|
| Resource ID | Unique resource ID in the service |
| Creator | User who created the service |
| Creation time | Service creation time |
| Modifier | User who edited the service information |
| Modification date/time | Date and time when the service information was modified |
| Server name | Server name |
| Node pool | A group of nodes that can be grouped together with the same Cluster Fabric |
| Cluster Fabric name | Cluster Fabric name created by the user |
| Image/Version | Server’s OS image and version |
| Server Type | CPU, Memory, GPU, Information Display |
| Planned Compute | Resource status with Planned Compute set
|
| Lock | Display lock usage status
|
| Network | GPU Node network information
|
| Block Storage | Block Storage information connected to the server
|
| Init Script | View the Init Script content entered when creating the server |
Tag
GPU Node List page’s Tag Tab you can view the tag information of the selected resource, and add, modify, or delete it.
| Category | Detailed description |
|---|---|
| Tag List | Tag List
|
Work History
GPU Node List page’s Job History tab allows you to view the job history of the selected resource.
| Category | Detailed description |
|---|---|
| Work History List | Resource Change History
|
GPU Node Operation Control
If you need server control and management functions for the generated GPU Node resources, you can perform tasks on the GPU Node List or GPU Node Details page. You can start and stop the running GPU Node resources.
GPU Node Getting Started
You can start a stopped GPU Node. To start the GPU Node, follow the steps below.
- All Services > Compute > Multi-node GPU Cluster Click the menu. Navigate to the Service Home page of Multi-node GPU Cluster.
- Click the GPU Node menu on the Service Home page. You will be taken to the GPU Node List page.
- On the GPU Node List page, after selecting individual or multiple servers with the checkbox, you can Start via the More button at the top.
- GPU Node List page, click the resource. GPU Node Details page will be opened.
- GPU Node Details on the page, click the Start button at the top to start the server.
- Check the server status and complete the status change.
Stop GPU Node
You can stop a GPU Node that is active. To stop the GPU Node, follow the steps below.
- All Services > Compute > Multi-node GPU Cluster Click the menu. Move to Multi-node GPU Cluster’s Service Home page.
- Click the GPU Node menu on the Service Home page. You will be taken to the GPU Node List page.
- GPU Node List page, you can control individual or multiple servers by selecting the checkboxes and then using the Stop button at the top.
- GPU Node List page, click the resource. GPU Node Details page, navigate.
- GPU Node Details on the page, click the Stop button at the top to stop the server.
- Check the server status and complete the status change.
GPU Node Cancel
You can cancel unused GPU nodes to reduce operating costs. However, if you cancel the service, the running service may be stopped immediately, so you should consider the impact of service interruption sufficiently before proceeding with the cancellation.
To cancel the GPU Node, follow the steps below.
- All Services > Compute > Multi-node GPU Server Click the menu. Navigate to the Service Home page of the Multi-node GPU Cluster.
- Click the Cluster Fabric menu on the Service Home page. You will be taken to the Cluster Fabric List page.
- Cluster Fabric List page, select the resource to cancel, and click the Cancel Service button.
- Resources using the same Cluster Fabric can be terminated simultaneously.
- Once the termination is complete, check on the GPU Node List page whether the resources have been terminated.
The cases where GPU Node termination is not possible are as follows.
- When Block Storage (BM) is connected: Please disconnect the Block Storage (BM) connection first.
- If File Storage is connected: Please disconnect the File Storage first.
- When Lock is set: Please change the Lock setting to unused and try again.
- If the server that cannot be terminated simultaneously is included: Please re-select only the resources that can be terminated.
- If the Cluster Fabric of the server you want to terminate is different: Select only resources that use the same Cluster Fabric.
2.1 - Cluster Fabric Management
Cluster Fabric is a service that helps manage servers (GPU Nodes) included in a GPU Cluster. Using Cluster Fabric, you can move servers between GPU Clusters in the same Node pool and optimize the performance and speed of GPUs within the same GPU Cluster.
Creating Cluster Fabric
Cluster Fabric can be created together with a GPU Node, and it cannot be created or deleted separately. When all GPU Nodes within a Cluster Fabric are terminated, the Cluster Fabric is automatically deleted. If you haven’t created a GPU Node, please create one first. For more information, refer to Creating a GPU Node.
Checking Cluster Fabric Details
- Cluster Fabric can be created together with a GPU Node, and it cannot be created or deleted separately.
- When all GPU Nodes within a Cluster Fabric are terminated, the Cluster Fabric is automatically deleted.
- If you haven’t created a GPU Node, please create one first. For more information, refer to Creating a GPU Node.
You can check the created Cluster Fabric list and details, and move servers on the Cluster Fabric List page and Cluster Fabric Details page.
Click on All Services > Compute > Multi-node GPU Server menu. It will move to the Service Home page of the Multi-node GPU Cluster.
Click on the Cluster Fabric menu on the Service Home page. It will move to the Cluster Fabric List page.
- On the Cluster Fabric List page, you can view the list of resources of the GPU Cluster created by the user.
- Resource items other than required columns can be added through the Settings button.
Category RequiredDescription Resource ID Optional Cluster Fabric ID created by the user Cluster Fabric Name Required Cluster Fabric name created by the user Node Pool Optional A collection of nodes that can be bundled into the same Cluster Fabric Number of Servers Optional Number of GPU Nodes Server Type Optional Server type of the GPU Node - The user can check the number of cores, memory capacity, and GPU type and number of the created resource
Status Optional Status of the Cluster Fabric created by the user Creation Time Optional Time when the Cluster Fabric was created Table. Cluster Fabric resource list items
Click on the resource to check the details on the Cluster Fabric List page. It will move to the Cluster Fabric Details page.
- At the top of the Cluster Fabric Details page, status information and additional feature descriptions are displayed.
Category Description Cluster Fabric Status Status of the Cluster Fabric created by the user - Creating: Cluster creation in progress
- Active: Creation completed and available
- Editing: IP change in progress
- Deleting: Termination in progress
- Deleted: Termination completed
Add Target Server Function to move a server from another cluster to this cluster Table. Cluster Fabric status information and additional features
- At the top of the Cluster Fabric Details page, status information and additional feature descriptions are displayed.
Details
On the Details tab of the Cluster Fabric List page, you can check the details of the selected resource and bring in servers from other clusters.
| Category | Description |
|---|---|
| Service | Service category |
| Resource Type | Service name |
| SRN | Unique resource ID in Samsung Cloud Platform
|
| Resource Name | Resource name
|
| Resource ID | Unique resource ID in the service |
| Creator | User who created the service |
| Creation Time | Time when the service was created |
| Modifier | User who modified the service information |
| Modification Time | Time when the service information was modified |
| Cluster Fabric Name | Cluster Fabric name created by the user |
| Node Pool | A collection of nodes that can be bundled into the same Cluster Fabric |
| Target Server | List of GPU Nodes bound to the Cluster Fabric
|
Bringing in Cluster Fabric Servers
Using the Add Target Server feature on the Cluster Fabric Details page, you can bring in servers from other clusters and add them to the selected cluster.
- Click on All Services > Compute > Multi-node GPU Server menu. It will move to the Service Home page of the Multi-node GPU Cluster.
- Click on the Cluster Fabric menu on the Service Home page. It will move to the Cluster Fabric List page.
- Click on the resource to check the details on the Cluster Fabric List page. It will move to the Cluster Fabric Details page.
- Click the Add button on the right side of the target server on the details tab.
- The target server addition popup window opens.
- Cluster Fabric Select a cluster.
- The GPU Node bound to the selected cluster is retrieved, and you can select the GPU Node to bring in.
- The selected GPU Node is listed at the bottom with the GPU Node name.
- Click the Confirm button to complete.
- Click the Cancel button to cancel the task.
- Check if the added GPU Node is retrieved in the target server.
- The target server addition popup window opens.
Terminating Cluster Fabric
When all GPU Nodes within a Cluster Fabric are terminated, the Cluster Fabric is automatically deleted. For more information, refer to Terminating a GPU Node.
2.2 - ServiceWatch Agent Install
Users can install the ServiceWatch Agent on the GPU node of a Multi-node GPU Cluster to collect custom metrics and logs.
ServiceWatch Agent
The agents that need to be installed on the GPU nodes of a multi-node GPU cluster for collecting ServiceWatch custom metrics and logs can be broadly divided into two types. This is Prometheus Exporter and Open Telemetry Collector.
| Category | Detailed description | |
|---|---|---|
| Prometheus Exporter | Provides metrics of a specific application or service in a format that Prometheus can scrape
| |
| Open Telemetry Collector | Acts as a centralized collector that gathers telemetry data such as metrics and logs from distributed systems, processes (filtering, sampling, etc.) them, and then exports to various backends (e.g., Prometheus, Jaeger, Elasticsearch, etc.)
|
If you have configured Kubernetes Engine on a GPU Node, please check GPU metrics through the metrics provided by Kubernetes Engine.
- If you install the DCGM Exporter on a GPU node where Kubernetes Engine is configured, it may not work properly.
2.3 - Multi-node GPU Cluster Service Scope and Inspection Guide
Multi-node GPU Cluster service scope
In the event of an IaaS HW level issue with the Multi-node GPU Cluster service, technical support can be received through the Support Center’s Contact Us. However, risks due to changes such as OS Kernel updates or application installation are the responsibility of the user, so technical support may be difficult, please be cautious when performing system updates or other tasks.
IaaS HW level problem
- IPMI(iLO) HW monitoring console where the server’s internal HW fault event occurrence message occurs
- GPU HW operation error confirmed in nvdia-smi command
- HW error messages occurring from InfiniBand HCA card or InfiniBand Switch inspection
IaaS HW Inspection Guide
After applying for the Multi-node GPU Cluster service, it is recommended to check the IaaS HW level according to the inspection guide.
OS Kernel and Package Holding
- If you do not want automatic updates of package versions, it is recommended to block package updates using the
apt-markcommand. - It is recommended to block the update of Linux kernel or IB related package versions.
To proceed with OS Kernel and Package holding, follow the procedure below.
- Use the following command to check the version of the kernel and IB-related packages.Color mode
root@bm-dev-001:~# dpkg -l | egrep -i "kernel | mlnx" root@bm-dev-001:~# dpkg -l | egrep -i "kernel | nvidia" root@bm-dev-001:~# dpkg -l | egrep -i "kernel | linux-image" ii crash 7.2.8-1ubuntu1.20.04.1 amd64 kernel debugging utility, allowing gdb like syntax ii dkms 2.8.1-5ubuntu2 all Dynamic Kernel Module Support Framework ii dmeventd 2:1.02.167-1ubuntu1 amd64 Linux Kernel Device Mapper event daemon ii dmsetup 2:1.02.167-1ubuntu1 amd64 Linux Kernel Device Mapper userspace library ii iser-dkms 5.4-OFED.5.4.3.0.1.1 all DKMS support fo iser kernel modules ii isert-dkms 5.4-OFED.5.4.3.0.1.1 all DKMS support fo isert kernel modules ii kernel-mft-dkms 4.17.2-12 all DKMS support for kernel-mft kernel modules ii kmod 27-1ubuntu2 amd64 tools for managing Linux kernel modules ii knem 1.1.4.90mlnx1-OFED.5.1.2.5.0.1 amd64 userspace tools for the KNEM kernel module ii knem-dkms 1.1.4.90mlnx1-OFED.5.1.2.5.0.1 all DKMS support for mlnx-ofed kernel modules ii libaio1:amd64 0.3.112-5 amd64 Linux kernel AIO access library - shared library ii libdevmapper-event1.02.1:amd64 2:1.02.167-1ubuntu1 amd64 Linux Kernel Device Mapper event support library ii libdevmapper1.02.1:amd64 2:1.02.167-1ubuntu1 amd64 Linux Kernel Device Mapper userspace library ii libdrm-amdgpu1:amd64 2.4.107-8ubuntu1~20.04.2 amd64 Userspace interface to amdgpu-specific kernel DRM services -- runtime ii libdrm-common 2.4.107-8ubuntu1~20.04.2 all Userspace interface to kernel DRM services -- common files ii libdrm-intel1:amd64 2.4.107-8ubuntu1~20.04.2 amd64 Userspace interface to intel-specific kernel DRM services -- runtime ii libdrm-nouveau2:amd64 2.4.107-8ubuntu1~20.04.2 amd64 Userspace interface to nouveau-specific kernel DRM services -- runtime ii libdrm-radeon1:amd64 2.4.107-8ubuntu1~20.04.2 amd64 Userspace interface to radeon-specific kernel DRM services -- runtime ii libdrm2:amd64 2.4.107-8ubuntu1~20.04.2 amd64 Userspace interface to kernel DRM services -- runtime ii linux-firmware 1.187.29 all Firmware for Linux kernel drivers hi linux-generic 5.4.0.105.109 amd64 Complete Generic Linux kernel and headers ii linux-headers-5.4.0-104 5.4.0-104.118 all Header files related to Linux kernel version 5.4.0 ii linux-headers-5.4.0-104-generic 5.4.0-104.118 amd64 Linux kernel headers for version 5.4.0 on 64 bit x86 SMP ii linux-headers-5.4.0-105 5.4.0-105.119 all Header files related to Linux kernel version 5.4.0 ii linux-headers-5.4.0-105-generic 5.4.0-105.119 amd64 Linux kernel headers for version 5.4.0 on 64 bit x86 SMP hi linux-headers-generic 5.4.0.105.109 amd64 Generic Linux kernel headers ii linux-image-5.4.0-104-generic 5.4.0-104.118 amd64 Signed kernel image generic ii linux-image-5.4.0-105-generic 5.4.0-105.119 amd64 Signed kernel image generic hi linux-image-generic 5.4.0.105.109 amd64 Generic Linux kernel image ii linux-libc-dev:amd64 5.4.0-105.119 amd64 Linux Kernel Headers for development ii linux-modules-5.4.0-104-generic 5.4.0-104.118 amd64 Linux kernel extra modules for version 5.4.0 on 64 bit x86 SMP ii linux-modules-5.4.0-105-generic 5.4.0-105.119 amd64 Linux kernel extra modules for version 5.4.0 on 64 bit x86 SMP ii linux-modules-extra-5.4.0-104-generic 5.4.0-104.118 amd64 Linux kernel extra modules for version 5.4.0 on 64 bit x86 SMP ii linux-modules-extra-5.4.0-105-generic 5.4.0-105.119 amd64 Linux kernel extra modules for version 5.4.0 on 64 bit x86 SMP ii mlnx-ofed-kernel-dkms 5.4-OFED.5.4.3.0.3.1 all DKMS support for mlnx-ofed kernel modules ii mlnx-ofed-kernel-utils 5.4-OFED.5.4.3.0.3.1 amd64 Userspace tools to restart and tune mlnx-ofed kernel modules ii mlnx-tools 5.2.0-0.54303 amd64 Userspace tools to restart and tune MLNX_OFED kernel modules ii nvidia-kernel-common-470 470.103.01-0ubuntu0.20.04.1 amd64 Shared files used with the kernel module ii nvidia-kernel-source-470 470.103.01-0ubuntu0.20.04.1 amd64 NVIDIA kernel source package ii nvidia-peer-memory 1.2-0 all nvidia peer memory kernel module. ii nvidia-peer-memory-dkms 1.2-0 all DKMS support for nvidia-peer-memory kernel modules ii rsyslog 8.2001.0-1ubuntu1.1 amd64 reliable system and kernel logging daemon ii srp-dkms 5.4-OFED.5.4.3.0.1.1 all DKMS support fo srp kernel modulesroot@bm-dev-001:~# dpkg -l | egrep -i "kernel | mlnx" root@bm-dev-001:~# dpkg -l | egrep -i "kernel | nvidia" root@bm-dev-001:~# dpkg -l | egrep -i "kernel | linux-image" ii crash 7.2.8-1ubuntu1.20.04.1 amd64 kernel debugging utility, allowing gdb like syntax ii dkms 2.8.1-5ubuntu2 all Dynamic Kernel Module Support Framework ii dmeventd 2:1.02.167-1ubuntu1 amd64 Linux Kernel Device Mapper event daemon ii dmsetup 2:1.02.167-1ubuntu1 amd64 Linux Kernel Device Mapper userspace library ii iser-dkms 5.4-OFED.5.4.3.0.1.1 all DKMS support fo iser kernel modules ii isert-dkms 5.4-OFED.5.4.3.0.1.1 all DKMS support fo isert kernel modules ii kernel-mft-dkms 4.17.2-12 all DKMS support for kernel-mft kernel modules ii kmod 27-1ubuntu2 amd64 tools for managing Linux kernel modules ii knem 1.1.4.90mlnx1-OFED.5.1.2.5.0.1 amd64 userspace tools for the KNEM kernel module ii knem-dkms 1.1.4.90mlnx1-OFED.5.1.2.5.0.1 all DKMS support for mlnx-ofed kernel modules ii libaio1:amd64 0.3.112-5 amd64 Linux kernel AIO access library - shared library ii libdevmapper-event1.02.1:amd64 2:1.02.167-1ubuntu1 amd64 Linux Kernel Device Mapper event support library ii libdevmapper1.02.1:amd64 2:1.02.167-1ubuntu1 amd64 Linux Kernel Device Mapper userspace library ii libdrm-amdgpu1:amd64 2.4.107-8ubuntu1~20.04.2 amd64 Userspace interface to amdgpu-specific kernel DRM services -- runtime ii libdrm-common 2.4.107-8ubuntu1~20.04.2 all Userspace interface to kernel DRM services -- common files ii libdrm-intel1:amd64 2.4.107-8ubuntu1~20.04.2 amd64 Userspace interface to intel-specific kernel DRM services -- runtime ii libdrm-nouveau2:amd64 2.4.107-8ubuntu1~20.04.2 amd64 Userspace interface to nouveau-specific kernel DRM services -- runtime ii libdrm-radeon1:amd64 2.4.107-8ubuntu1~20.04.2 amd64 Userspace interface to radeon-specific kernel DRM services -- runtime ii libdrm2:amd64 2.4.107-8ubuntu1~20.04.2 amd64 Userspace interface to kernel DRM services -- runtime ii linux-firmware 1.187.29 all Firmware for Linux kernel drivers hi linux-generic 5.4.0.105.109 amd64 Complete Generic Linux kernel and headers ii linux-headers-5.4.0-104 5.4.0-104.118 all Header files related to Linux kernel version 5.4.0 ii linux-headers-5.4.0-104-generic 5.4.0-104.118 amd64 Linux kernel headers for version 5.4.0 on 64 bit x86 SMP ii linux-headers-5.4.0-105 5.4.0-105.119 all Header files related to Linux kernel version 5.4.0 ii linux-headers-5.4.0-105-generic 5.4.0-105.119 amd64 Linux kernel headers for version 5.4.0 on 64 bit x86 SMP hi linux-headers-generic 5.4.0.105.109 amd64 Generic Linux kernel headers ii linux-image-5.4.0-104-generic 5.4.0-104.118 amd64 Signed kernel image generic ii linux-image-5.4.0-105-generic 5.4.0-105.119 amd64 Signed kernel image generic hi linux-image-generic 5.4.0.105.109 amd64 Generic Linux kernel image ii linux-libc-dev:amd64 5.4.0-105.119 amd64 Linux Kernel Headers for development ii linux-modules-5.4.0-104-generic 5.4.0-104.118 amd64 Linux kernel extra modules for version 5.4.0 on 64 bit x86 SMP ii linux-modules-5.4.0-105-generic 5.4.0-105.119 amd64 Linux kernel extra modules for version 5.4.0 on 64 bit x86 SMP ii linux-modules-extra-5.4.0-104-generic 5.4.0-104.118 amd64 Linux kernel extra modules for version 5.4.0 on 64 bit x86 SMP ii linux-modules-extra-5.4.0-105-generic 5.4.0-105.119 amd64 Linux kernel extra modules for version 5.4.0 on 64 bit x86 SMP ii mlnx-ofed-kernel-dkms 5.4-OFED.5.4.3.0.3.1 all DKMS support for mlnx-ofed kernel modules ii mlnx-ofed-kernel-utils 5.4-OFED.5.4.3.0.3.1 amd64 Userspace tools to restart and tune mlnx-ofed kernel modules ii mlnx-tools 5.2.0-0.54303 amd64 Userspace tools to restart and tune MLNX_OFED kernel modules ii nvidia-kernel-common-470 470.103.01-0ubuntu0.20.04.1 amd64 Shared files used with the kernel module ii nvidia-kernel-source-470 470.103.01-0ubuntu0.20.04.1 amd64 NVIDIA kernel source package ii nvidia-peer-memory 1.2-0 all nvidia peer memory kernel module. ii nvidia-peer-memory-dkms 1.2-0 all DKMS support for nvidia-peer-memory kernel modules ii rsyslog 8.2001.0-1ubuntu1.1 amd64 reliable system and kernel logging daemon ii srp-dkms 5.4-OFED.5.4.3.0.1.1 all DKMS support fo srp kernel modulesCode block. Kernel, IB related package version check - Use the apt-mark command to hold the package update.Color mode
# apt-mark hold <package name># apt-mark hold <package name>Code block. Package update hold
Intel E810 Driver Update
Check the version of the Intel E810 driver and update it to the recommended version.
- Server manufacturer Intel E810 driver recommended version: 1.15.4
- Driver (ice-1.15.4.tar.gz) download
The driver update method is as follows.
- Move the basic driver tar file to the desired directory.
Example: /home/username/ice or /usr/local/src/ice
Untar / unzip the Archiver file.
- x.x.x is the version number of the driver tar file.Color mode
tar zxf ice-x.x.x.tar.gztar zxf ice-x.x.x.tar.gzCode block. Unzip file
- x.x.x is the version number of the driver tar file.
Change to the driver src directory.
- x.x.x is the version number of the driver tar file.Color mode
cd ice-x.x.x/src/cd ice-x.x.x/src/Code block. Directory change
- x.x.x is the version number of the driver tar file.
Compile the driver module.
Color modemake installmake installCode Block. Driver Module Compile After the update is complete, check the version.
Color modelsmod | grep ice modinfo ice | grep versionlsmod | grep ice modinfo ice | grep versionCode Block. Version Check
NVIDIA driver check
nvidia-smi topo, IB nv_peer_mem status checkTo check the NVIDIA driver (nvidia-smi topo, IB nv_peer_mem status) and inspect the IaaS HW level, follow the next procedure.
Check the GPU driver and HW status.
Color modeuser@bm-dev-001:~$ nvidia-smi topo -m GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx5_0 mlx5_1 mlx5_2 mlx5_3 CPU Affinity NUMA Affinity GPU0 X NV12 NV12 NV12 NV12 NV12 NV12 NV12 SYS PXB SYS SYS 48-63 3 GPU1 NV12 X NV12 NV12 NV12 NV12 NV12 NV12 SYS PXB SYS SYS 48-63 3 GPU2 NV12 NV12 X NV12 NV12 NV12 NV12 NV12 PXB SYS SYS SYS 16-31 1 GPU3 NV12 NV12 NV12 X NV12 NV12 NV12 NV12 PXB SYS SYS SYS 16-31 1 GPU4 NV12 NV12 NV12 NV12 X NV12 NV12 NV12 SYS SYS SYS PXB 112-127 7 GPU5 NV12 NV12 NV12 NV12 NV12 X NV12 NV12 SYS SYS SYS PXB 112-127 7 GPU6 NV12 NV12 NV12 NV12 NV12 NV12 X NV12 SYS SYS PXB SYS 80-95 5 GPU7 NV12 NV12 NV12 NV12 NV12 NV12 NV12 X SYS SYS PXB SYS 80-95 5 mlx5_0 SYS SYS PXB PXB SYS SYS SYS SYS X SYS SYS SYS mlx5_1 PXB PXB SYS SYS SYS SYS SYS SYS SYS X SYS SYS mlx5_2 SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS X SYS mlx5_3 SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS X Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinksuser@bm-dev-001:~$ nvidia-smi topo -m GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx5_0 mlx5_1 mlx5_2 mlx5_3 CPU Affinity NUMA Affinity GPU0 X NV12 NV12 NV12 NV12 NV12 NV12 NV12 SYS PXB SYS SYS 48-63 3 GPU1 NV12 X NV12 NV12 NV12 NV12 NV12 NV12 SYS PXB SYS SYS 48-63 3 GPU2 NV12 NV12 X NV12 NV12 NV12 NV12 NV12 PXB SYS SYS SYS 16-31 1 GPU3 NV12 NV12 NV12 X NV12 NV12 NV12 NV12 PXB SYS SYS SYS 16-31 1 GPU4 NV12 NV12 NV12 NV12 X NV12 NV12 NV12 SYS SYS SYS PXB 112-127 7 GPU5 NV12 NV12 NV12 NV12 NV12 X NV12 NV12 SYS SYS SYS PXB 112-127 7 GPU6 NV12 NV12 NV12 NV12 NV12 NV12 X NV12 SYS SYS PXB SYS 80-95 5 GPU7 NV12 NV12 NV12 NV12 NV12 NV12 NV12 X SYS SYS PXB SYS 80-95 5 mlx5_0 SYS SYS PXB PXB SYS SYS SYS SYS X SYS SYS SYS mlx5_1 PXB PXB SYS SYS SYS SYS SYS SYS SYS X SYS SYS mlx5_2 SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS X SYS mlx5_3 SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS X Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinksCode Block. GPU Driver and HW Status Check Check the NVSwitch HW status.
Color modeuser@bm-dev-001:~$ nvidia-smi nvlink --status GPU 0: NVIDIA A100-SXM4-80GB (UUID: GPU-2c0d1d6b-e348-55fc-44cf-cd65a954b36c) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 1: NVIDIA A100-SXM4-80GB (UUID: GPU-96f429d8-893a-a9ea-deca-feffd90669e9) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 2: NVIDIA A100-SXM4-80GB (UUID: GPU-2e601952-b442-b757-a035-725cd320f589) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 3: NVIDIA A100-SXM4-80GB (UUID: GPU-bcbfd885-a9f8-ec8c-045b-c521472b4fed) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 4: NVIDIA A100-SXM4-80GB (UUID: GPU-30273090-2d78-fc7a-a360-ec5f871dd488) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 5: NVIDIA A100-SXM4-80GB (UUID: GPU-5ce7ef61-56dd-fb18-aa7c-be610c8d51c3) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 6: NVIDIA A100-SXM4-80GB (UUID: GPU-740a527b-b286-8b85-35eb-b6b41c0bb6d7) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 7: NVIDIA A100-SXM4-80GB (UUID: GPU-1fb6de95-60f6-dbf2-ffca-f7680577e37c) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/suser@bm-dev-001:~$ nvidia-smi nvlink --status GPU 0: NVIDIA A100-SXM4-80GB (UUID: GPU-2c0d1d6b-e348-55fc-44cf-cd65a954b36c) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 1: NVIDIA A100-SXM4-80GB (UUID: GPU-96f429d8-893a-a9ea-deca-feffd90669e9) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 2: NVIDIA A100-SXM4-80GB (UUID: GPU-2e601952-b442-b757-a035-725cd320f589) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 3: NVIDIA A100-SXM4-80GB (UUID: GPU-bcbfd885-a9f8-ec8c-045b-c521472b4fed) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 4: NVIDIA A100-SXM4-80GB (UUID: GPU-30273090-2d78-fc7a-a360-ec5f871dd488) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 5: NVIDIA A100-SXM4-80GB (UUID: GPU-5ce7ef61-56dd-fb18-aa7c-be610c8d51c3) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 6: NVIDIA A100-SXM4-80GB (UUID: GPU-740a527b-b286-8b85-35eb-b6b41c0bb6d7) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/s GPU 7: NVIDIA A100-SXM4-80GB (UUID: GPU-1fb6de95-60f6-dbf2-ffca-f7680577e37c) Link 0: 25 GB/s Link 1: 25 GB/s Link 2: 25 GB/s Link 3: 25 GB/s Link 4: 25 GB/s Link 5: 25 GB/s Link 6: 25 GB/s Link 7: 25 GB/s Link 8: 25 GB/s Link 9: 25 GB/s Link 10: 25 GB/s Link 11: 25 GB/sCode block. NVSwitch HW status check Check the InfiniBand(IB) HCA card HW status and Link.
Color modeuser@bm-dev-001:~$ ibdev2netdev -v cat: /sys/class/infiniband/mlx5_0/device/vpd: Permission denied 0000:45:00.0 mlx5_0 (MT4123 - ) fw 20.29.1016 port 1 (ACTIVE) ==> ibs18 (Down) cat: /sys/class/infiniband/mlx5_1/device/vpd: Permission denied 0000:0e:00.0 mlx5_1 (MT4123 - ) fw 20.29.1016 port 1 (ACTIVE) ==> ibs17 (Down) cat: /sys/class/infiniband/mlx5_2/device/vpd: Permission denied 0000:c5:00.0 mlx5_2 (MT4123 - ) fw 20.29.1016 port 1 (ACTIVE) ==> ibs20 (Down) cat: /sys/class/infiniband/mlx5_3/device/vpd: Permission denied 0000:85:00.0 mlx5_3 (MT4123 - ) fw 20.29.1016 port 1 (ACTIVE) ==> ibs19 (Down) user@bm-dev-001:~$ root@bm-dev-001:~# ibstat CA 'mlx5_0' CA type: MT4123 Number of ports: 1 Firmware version: 20.29.1016 Hardware version: 0 Node GUID: 0x88e9a4ffff5060ac System image GUID: 0x88e9a4ffff5060ac Port 1: State: Active Physical state: LinkUp Rate: 200 Base lid: 8 LMC: 0 SM lid: 1 Capability mask: 0x2651e848 Port GUID: 0x88e9a4ffff5060ac Link layer: InfiniBand CA 'mlx5_1' CA type: MT4123 Number of ports: 1 Firmware version: 20.29.1016 Hardware version: 0 Node GUID: 0x88e9a4ffff504080 System image GUID: 0x88e9a4ffff504080 Port 1: State: Active Physical state: LinkUp Rate: 200 Base lid: 5 LMC: 0 SM lid: 1 Capability mask: 0x2651e848 Port GUID: 0x88e9a4ffff504080 Link layer: InfiniBand CA 'mlx5_2' CA type: MT4123 Number of ports: 1 Firmware version: 20.29.1016 Hardware version: 0 Node GUID: 0x88e9a4ffff505038 System image GUID: 0x88e9a4ffff505038 Port 1: State: Active Physical state: LinkUp Rate: 200 Base lid: 2 LMC: 0 SM lid: 1 Capability mask: 0x2651e848 Port GUID: 0x88e9a4ffff505038 Link layer: InfiniBand CA 'mlx5_3' CA type: MT4123 Number of ports: 1 Firmware version: 20.29.1016 Hardware version: 0 Node GUID: 0x88e9a4ffff504094 System image GUID: 0x88e9a4ffff504094 Port 1: State: Active Physical state: LinkUp Rate: 200 Base lid: 7 LMC: 0 SM lid: 1 Capability mask: 0x2651e848 Port GUID: 0x88e9a4ffff504094 Link layer: InfiniBanduser@bm-dev-001:~$ ibdev2netdev -v cat: /sys/class/infiniband/mlx5_0/device/vpd: Permission denied 0000:45:00.0 mlx5_0 (MT4123 - ) fw 20.29.1016 port 1 (ACTIVE) ==> ibs18 (Down) cat: /sys/class/infiniband/mlx5_1/device/vpd: Permission denied 0000:0e:00.0 mlx5_1 (MT4123 - ) fw 20.29.1016 port 1 (ACTIVE) ==> ibs17 (Down) cat: /sys/class/infiniband/mlx5_2/device/vpd: Permission denied 0000:c5:00.0 mlx5_2 (MT4123 - ) fw 20.29.1016 port 1 (ACTIVE) ==> ibs20 (Down) cat: /sys/class/infiniband/mlx5_3/device/vpd: Permission denied 0000:85:00.0 mlx5_3 (MT4123 - ) fw 20.29.1016 port 1 (ACTIVE) ==> ibs19 (Down) user@bm-dev-001:~$ root@bm-dev-001:~# ibstat CA 'mlx5_0' CA type: MT4123 Number of ports: 1 Firmware version: 20.29.1016 Hardware version: 0 Node GUID: 0x88e9a4ffff5060ac System image GUID: 0x88e9a4ffff5060ac Port 1: State: Active Physical state: LinkUp Rate: 200 Base lid: 8 LMC: 0 SM lid: 1 Capability mask: 0x2651e848 Port GUID: 0x88e9a4ffff5060ac Link layer: InfiniBand CA 'mlx5_1' CA type: MT4123 Number of ports: 1 Firmware version: 20.29.1016 Hardware version: 0 Node GUID: 0x88e9a4ffff504080 System image GUID: 0x88e9a4ffff504080 Port 1: State: Active Physical state: LinkUp Rate: 200 Base lid: 5 LMC: 0 SM lid: 1 Capability mask: 0x2651e848 Port GUID: 0x88e9a4ffff504080 Link layer: InfiniBand CA 'mlx5_2' CA type: MT4123 Number of ports: 1 Firmware version: 20.29.1016 Hardware version: 0 Node GUID: 0x88e9a4ffff505038 System image GUID: 0x88e9a4ffff505038 Port 1: State: Active Physical state: LinkUp Rate: 200 Base lid: 2 LMC: 0 SM lid: 1 Capability mask: 0x2651e848 Port GUID: 0x88e9a4ffff505038 Link layer: InfiniBand CA 'mlx5_3' CA type: MT4123 Number of ports: 1 Firmware version: 20.29.1016 Hardware version: 0 Node GUID: 0x88e9a4ffff504094 System image GUID: 0x88e9a4ffff504094 Port 1: State: Active Physical state: LinkUp Rate: 200 Base lid: 7 LMC: 0 SM lid: 1 Capability mask: 0x2651e848 Port GUID: 0x88e9a4ffff504094 Link layer: InfiniBandCode block. InfiniBand(IB) HCA card HW status and Link check
IB bandwidth communication check
To check the IB bandwidth communication status (ib_send_bw) and inspect the IaaS HW level, follow these steps.
Check the name of the IB HCA interface.
Color modeuser@bm-dev-001:~$ ibdev2netdev mlx5_0 port 1 ==> ibs18 (Down) mlx5_1 port 1 ==> ibs17 (Down) mlx5_2 port 1 ==> ibs20 (Down) mlx5_3 port 1 ==> ibs19 (Down)user@bm-dev-001:~$ ibdev2netdev mlx5_0 port 1 ==> ibs18 (Down) mlx5_1 port 1 ==> ibs17 (Down) mlx5_2 port 1 ==> ibs20 (Down) mlx5_3 port 1 ==> ibs19 (Down)Code block. Check the name of IB HCA interface Check the HCA interface that can communicate with IB Switch#1.
Color modemlx5_0 port 1 ==> ibs18 (Down) mlx5_2 port 1 ==> ibs20 (Down)mlx5_0 port 1 ==> ibs18 (Down) mlx5_2 port 1 ==> ibs20 (Down)Code Block. HCA Interface Check Check the HCA interface that can communicate with IB Switch#2.
Color modemlx5_1 port 1 ==> ibs17 (Down) mlx5_3 port 1 ==> ibs19 (Down)mlx5_1 port 1 ==> ibs17 (Down) mlx5_3 port 1 ==> ibs19 (Down)Code Block. HCA Interface Check Use SERVER Side commands to check the communication status.
Client Sidecommand is entered secondarily for mutual communicationColor modeuser@bm-dev-001:~$ ib_send_bw -d mlx5_3 -i 1 –F ************************************ * Waiting for client to connect... * ************************************ --------------------------------------------------------------------------------------- Send BW Test Dual-port : OFF Device : mlx5_3 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON ibv_wr* API : ON RX depth : 512 CQ Moderation : 1 Mtu : 4096[B] Link type : IB Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0x07 QPN 0x002e PSN 0xa86622 remote address: LID 0x0a QPN 0x002d PSN 0xfc58dd --------------------------------------------------------------------------------------- #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps] 65536 1000 0.00 19827.40 0.317238 ---------------------------------------------------------------------------------------user@bm-dev-001:~$ ib_send_bw -d mlx5_3 -i 1 –F ************************************ * Waiting for client to connect... * ************************************ --------------------------------------------------------------------------------------- Send BW Test Dual-port : OFF Device : mlx5_3 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON ibv_wr* API : ON RX depth : 512 CQ Moderation : 1 Mtu : 4096[B] Link type : IB Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0x07 QPN 0x002e PSN 0xa86622 remote address: LID 0x0a QPN 0x002d PSN 0xfc58dd --------------------------------------------------------------------------------------- #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps] 65536 1000 0.00 19827.40 0.317238 ---------------------------------------------------------------------------------------Code Block. Communication Status Check
Use the
CLIENT Sidecommand to check the communication status.SERVER Sidecommand is entered first for mutual communicationColor moderoot@bm-dev-003:~# ib_send_bw -d mlx5_3 -i 1 -F <SERVER Side IP> --------------------------------------------------------------------------------------- Send BW Test Dual-port : OFF Device : mlx5_3 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON ibv_wr* API : ON TX depth : 128 CQ Moderation : 1 Mtu : 4096[B] Link type : IB Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0x0a QPN 0x002a PSN 0x98a48e remote address: LID 0x07 QPN 0x002c PSN 0xe68304 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps] 65536 1000 19008.49 19006.37 0.304102 ---------------------------------------------------------------------------------------root@bm-dev-003:~# ib_send_bw -d mlx5_3 -i 1 -F <SERVER Side IP> --------------------------------------------------------------------------------------- Send BW Test Dual-port : OFF Device : mlx5_3 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON ibv_wr* API : ON TX depth : 128 CQ Moderation : 1 Mtu : 4096[B] Link type : IB Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0x0a QPN 0x002a PSN 0x98a48e remote address: LID 0x07 QPN 0x002c PSN 0xe68304 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps] 65536 1000 19008.49 19006.37 0.304102 ---------------------------------------------------------------------------------------Code Block. Communication Status Check
Check IB Service Related Kernel Modules
Check the IB service-related kernel modules (lsmod) to inspect the IaaS HW level.
user@bm-dev-001:~$ lsmod | grep nv_peer_mem
nv_peer_mem 16384 0
ib_core 315392 9 rdma_cm,ib_ipoib,nv_peer_mem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
nvidia 35315712 156 nvidia_uvm,nv_peer_mem,nvidia_modesetuser@bm-dev-001:~$ lsmod | grep nv_peer_mem
nv_peer_mem 16384 0
ib_core 315392 9 rdma_cm,ib_ipoib,nv_peer_mem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
nvidia 35315712 156 nvidia_uvm,nv_peer_mem,nvidia_modesetuser@bm-dev-001:~$ service nv_peer_mem status
nv_peer_mem.service - LSB: Activates/Deactivates nv_peer_mem to \ start at boot time.
Loaded: loaded (/etc/init.d/nv_peer_mem; generated)
Active: active (exited) since Mon 2023-03-13 16:21:33 KST; 2 days ago
Docs: man:systemd-sysv-generator(8)
Process: 4913 ExecStart=/etc/init.d/nv_peer_mem start (code=exited, status=0/SUCCESS)user@bm-dev-001:~$ service nv_peer_mem status
nv_peer_mem.service - LSB: Activates/Deactivates nv_peer_mem to \ start at boot time.
Loaded: loaded (/etc/init.d/nv_peer_mem; generated)
Active: active (exited) since Mon 2023-03-13 16:21:33 KST; 2 days ago
Docs: man:systemd-sysv-generator(8)
Process: 4913 ExecStart=/etc/init.d/nv_peer_mem start (code=exited, status=0/SUCCESS)user@bm-dev-001:~$ lsmod | grep ib
libiscsi_tcp 32768 1 iscsi_tcp
libiscsi 57344 2 libiscsi_tcp,iscsi_tcp
scsi_transport_iscsi 110592 4 libiscsi_tcp,iscsi_tcp,libiscsi
ib_ipoib 131072 0
ib_cm 57344 2 rdma_cm,ib_ipoib
ib_umad 24576 8
mlx5_ib 380928 0
ib_uverbs 135168 18 rdma_ucm,mlx5_ib
ib_core 315392 9 rdma_cm,ib_ipoib,nv_peer_mem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
libcrc32c 16384 2 btrfs,raid456
mlx5_core 1458176 1 mlx5_ib
auxiliary 16384 2 mlx5_ib,mlx5_core
mlx_compat 65536 12 rdma_cm,ib_ipoib,mlxdevm,iw_cm,auxiliary,ib_umad,ib_core,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm,mlx5_coreuser@bm-dev-001:~$ lsmod | grep ib
libiscsi_tcp 32768 1 iscsi_tcp
libiscsi 57344 2 libiscsi_tcp,iscsi_tcp
scsi_transport_iscsi 110592 4 libiscsi_tcp,iscsi_tcp,libiscsi
ib_ipoib 131072 0
ib_cm 57344 2 rdma_cm,ib_ipoib
ib_umad 24576 8
mlx5_ib 380928 0
ib_uverbs 135168 18 rdma_ucm,mlx5_ib
ib_core 315392 9 rdma_cm,ib_ipoib,nv_peer_mem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
libcrc32c 16384 2 btrfs,raid456
mlx5_core 1458176 1 mlx5_ib
auxiliary 16384 2 mlx5_ib,mlx5_core
mlx_compat 65536 12 rdma_cm,ib_ipoib,mlxdevm,iw_cm,auxiliary,ib_umad,ib_core,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm,mlx5_coreStorage Physical Disk Resources and Multi-Path Check
Check the IaaS HW level by checking the storage physical disk resources and Multi-Path.
root@bm-dev-002:/tmp# fdisk –lroot@bm-dev-002:/tmp# fdisk –lroot@bm-dev-002:/tmp# multipath –llroot@bm-dev-002:/tmp# multipath –llMulti-node GPU Cluster new deployment after checking Service Network
Use the following command to check if the MII Status of Bonding and Slave Interface is up.
command
Color moderoot@mngc-001:~# cat /proc/net/bonding/bond-srv Ethernet Channel Bonding Driver: v5.15.0-25-genericroot@mngc-001:~# cat /proc/net/bonding/bond-srv Ethernet Channel Bonding Driver: v5.15.0-25-genericCode Block. Service Network Check Command confirmation result
Color modeBonding Mode: fault-tolerance (active-backup) Primary Slave: None Currently Active Slave: ens9f0 MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 0 Down Delay (ms): 0 Peer Notification Delay (ms): 0 Slave Interface: ens9f0 MII Status: up Speed: 100000 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: 30:3e:a7:02:35:70 Slave queue ID: 0 Slave Interface: ens11f0 MII Status: up Speed: 100000 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: 30:3e:a7:02:2f:e8 Slave queue ID: 0Bonding Mode: fault-tolerance (active-backup) Primary Slave: None Currently Active Slave: ens9f0 MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 0 Down Delay (ms): 0 Peer Notification Delay (ms): 0 Slave Interface: ens9f0 MII Status: up Speed: 100000 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: 30:3e:a7:02:35:70 Slave queue ID: 0 Slave Interface: ens11f0 MII Status: up Speed: 100000 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: 30:3e:a7:02:2f:e8 Slave queue ID: 0Code Block. Service Network Check Command Result
Multi-node GPU Cluster new deployment after checking Time Server and time synchronization
The OS image has the chrony daemon installed and set to synchronize with the SCP NTP server. Use the following command to check if there are any lines marked with ^* in the MS Name column.
command
Color moderoot@mngc-001:~# chronyc sources -Vroot@mngc-001:~# chronyc sources -VCode Block. chrony daemon installation command confirmation result
Color modeMS Name/IP address Stratum Poll Reach LastRx Last sample =============================================================================== ^+ 198.19.1.53 4 10 377 1040 -16us[ -37us] +/- 9982us ^* 198.19.1.54 4 10 377 312 -367us[ -388us] +/- 13msMS Name/IP address Stratum Poll Reach LastRx Last sample =============================================================================== ^+ 198.19.1.53 4 10 377 1040 -16us[ -37us] +/- 9982us ^* 198.19.1.54 4 10 377 312 -367us[ -388us] +/- 13msCode block. chrony daemon installation check result
GPU MIG/ECC Setting Initialization Check Guide
When applying for a multi-node GPU cluster product, the GPU MIG/ECC setting is initialized. However, to apply the exact setting value, please restart it once at the beginning, and then check and use it according to the inspection guide to see if the setting value is applied.
- MIG: Multi-Instance GPU
- ECC: Error Correction Code
MIG Setup Initialization
Refer to the following for how to check and initialize MIG settings.
Use the following command to check if the status value of MIG M is Disabled.
command
Color moderoot@bm-dev-001:~#nvidia-smiroot@bm-dev-001:~#nvidia-smiCode Block. MIG M. Initialize Settings confirmation result
Color mode+-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 470.129.06 Driver version: 470.129.06 CUDA Version: 11.4 | |----------------------------------+-----------------------------+------------------------| | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |==================================+=============================+========================| | 0 NVIDIA A100-SXM... Off | 00000000:03:00.0 Off | Off | | N/A 29C P0 57W / 400W | 0MiB / 81251MiB | 0% Default | | | | Disabled | +----------------------------------+-----------------------------+------------------------+ | 0 NVIDIA A100-SXM... Off | 00000000:0C:00.0 Off | Off | | N/A 30C P0 58W / 400W | 0MiB / 81251MiB | 18% Default | | | | Disabled | +-----------------------------------------------------------------------------------------++-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 470.129.06 Driver version: 470.129.06 CUDA Version: 11.4 | |----------------------------------+-----------------------------+------------------------| | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |==================================+=============================+========================| | 0 NVIDIA A100-SXM... Off | 00000000:03:00.0 Off | Off | | N/A 29C P0 57W / 400W | 0MiB / 81251MiB | 0% Default | | | | Disabled | +----------------------------------+-----------------------------+------------------------+ | 0 NVIDIA A100-SXM... Off | 00000000:0C:00.0 Off | Off | | N/A 30C P0 58W / 400W | 0MiB / 81251MiB | 18% Default | | | | Disabled | +-----------------------------------------------------------------------------------------+Code Block. MIG M. Initialization Setting Check Result If MIG M.’s status value is not Disabled, use the following command to initialize MIG.
Color moderoot@bm-dev-001:~# nvidia-smi -mig 0 root@bm-dev-001:~# nvidia-smi --gpu-resetroot@bm-dev-001:~# nvidia-smi -mig 0 root@bm-dev-001:~# nvidia-smi --gpu-resetCode Block. MIG M. Status Value Initialization
ECC Setting Initialization
Refer to the following for how to check and initialize the ECC settings.
Use the following command to check if the status value of Volatile Uncorr. ECC is Off.
command
Color moderoot@bm-dev-001:~#nvidia-smiroot@bm-dev-001:~#nvidia-smiCode Block. ECC Setting Command confirmation result
Color mode+-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 470.129.06 Driver version: 470.129.06 CUDA Version: 11.4 | |----------------------------------+-----------------------------+------------------------| | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |==================================+=============================+========================| | 0 NVIDIA A100-SXM... Off | 00000000:03:00.0 Off | Off | | N/A 29C P0 57W / 400W | 0MiB / 81251MiB | 0% Default | | | | Disabled | +----------------------------------+-----------------------------+------------------------+ | 0 NVIDIA A100-SXM... Off | 00000000:0C:00.0 Off | Off | | N/A 30C P0 61W / 400W | 0MiB / 81251MiB | 18% Default | | | | Disabled | +-----------------------------------------------------------------------------------------++-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 470.129.06 Driver version: 470.129.06 CUDA Version: 11.4 | |----------------------------------+-----------------------------+------------------------| | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |==================================+=============================+========================| | 0 NVIDIA A100-SXM... Off | 00000000:03:00.0 Off | Off | | N/A 29C P0 57W / 400W | 0MiB / 81251MiB | 0% Default | | | | Disabled | +----------------------------------+-----------------------------+------------------------+ | 0 NVIDIA A100-SXM... Off | 00000000:0C:00.0 Off | Off | | N/A 30C P0 61W / 400W | 0MiB / 81251MiB | 18% Default | | | | Disabled | +-----------------------------------------------------------------------------------------+Code Block. ECC Setting Check Result Volatile Uncorr. ECC’s status value is On*, please proceed with rebooting.
Volatile Uncorr. ECC status value is not On* or Off, use the following command to initialize ECC. After initialization, reboot and check if the status value is Off.
root@bm-dev-001:~# nvidia-smi --ecc-config=0root@bm-dev-001:~# nvidia-smi --ecc-config=03 - Release Note
Multi-node GPU Cluster
- You can cancel multiple resources at the same time from the GPU Node list.
- The nodes must use the same DataSet and Cluster Fabric.
- It has been linked with Cloud Monitoring.
- You can check major performance items in real-time in Cloud Monitoring.
- Multi-node GPU Cluster service has been launched.
- Provides a service that offers physical GPU servers without virtualization for large-scale high-performance AI computing.