The page has been translated by Gen AI.

How-to guides

The user can enter the required information for the Multi-node GPU Cluster service through the Samsung Cloud Platform Console, select detailed options, and create the service.

Multi-node GPU Cluster Getting Started

You can create and use a Multi-node GPU Cluster service in the Samsung Cloud Platform Console.

This service consists of GPU Node and Cluster Fabric services.

GPU Node Creation

To create a Multi-node GPU Cluster, follow the steps below.

All Services > Compute > Multi-node GPU Cluster Click the menu. Navigate to the Service Home page of Multi-node GPU Cluster.
Click the GPU Node creation button on the Service Home page. You will be taken to the GPU Node creation page.
GPU Node creation on the page, enter the information required to create the service, and select detailed options.
- Image and Version Selection Select the required information in the area.
  Category
  Required
  Detailed description
  Image Required Select provided image type
  Ubuntu
  Image Version Required Select version of the chosen image
  Provides a list of versions of the provided server images
  Table. GPU Node image and version selection items

Category	Required	Detailed description
Image	Required	Select provided image type Ubuntu
Image Version	Required	Select version of the chosen image Provides a list of versions of the provided server images

Enter service information area, input or select the required information.

Category	Required	Detailed description
Number of servers	Required	Number of GPU Node servers to create simultaneously Only numbers can be entered, and the minimum number of servers to create is 2. Only during the initial setup can you create 2 or more, and expansion is possible one at a time.
Service Type > Server Type	Required	GPU Node Server Type Select desired CPU, Memory, GPU, Disk specifications For detailed information about the server types provided by GPU Node, refer to Multi-node GPU Cluster Server Type
Service Type > Planned Compute	Required	Status of resources with Planned Compute set In Use: Number of resources with Planned Compute that are currently in use Configured: Number of resources with Planned Compute set Coverage Preview: Amount applied per resource by Planned Compute Planned Compute Service Application: Go to the Planned Compute service application page For details, refer to Apply for Planned Compute

Category

Required

Detailed description

Number of servers

Required

Number of GPU Node servers to create simultaneously

Only numbers can be entered, and the minimum number of servers to create is 2.

Only during the initial setup can you create 2 or more, and expansion is possible one at a time.

Service Type > Server Type

Required

GPU Node Server Type

Select desired CPU, Memory, GPU, Disk specifications

For detailed information about the server types provided by GPU Node, refer to Multi-node GPU Cluster Server Type

Service Type > Planned Compute

Required

Status of resources with Planned Compute set

In Use: Number of resources with Planned Compute that are currently in use

Configured: Number of resources with Planned Compute set

Coverage Preview: Amount applied per resource by Planned Compute

Planned Compute Service Application: Go to the Planned Compute service application page
- For details, refer to Apply for Planned Compute

Table. GPU Node Service Information Input Items

Required Information Input area, enter or select the required information.

Category

Category	Required or not	Detailed description
Administrator Account	Required	Set the administrator account and password to be used when connecting to the server Ubuntu OS is provided fixed as root
Server Name Prefix	Required	Enter a Prefix to distinguish each GPU Node generated when the number of selected servers is 2 or more Automatically generated as user input value (prefix) + ‘`-###`’ format Start with a lowercase English letter, and use lowercase letters, numbers, and special characters (`-`) within 3 to 11 characters Must not end with a special character (`-`)
Network Settings	Required	Set the network where the GPU Node will be installed VPC Name:Select a pre-created VPC General Subnet Name: Select a pre-created general Subnet IP can be set to auto-generate or user input, and if input is selected, the user enters the IP directly NAT: Can be used only when there is 1 server and the VPC has an Internet Gateway attached. Checking ‘use’ allows selection of a NAT IP. (When first created, it is generated only with 2 or more servers, so modify on the resource detail page) NAT IP: Select NAT IP If there is no NAT IP to select, click the Create New button to generate a Public IP Click the Refresh button to view and select the created Public IP Creating a Public IP incurs charges according to the Public IP pricing policy

Required or not

Detailed description

Administrator Account

Required

Set the administrator account and password to be used when connecting to the server

Ubuntu OS is provided fixed as root

Server Name Prefix

Required

Enter a Prefix to distinguish each GPU Node generated when the number of selected servers is 2 or more

Automatically generated as user input value (prefix) + ‘-###’ format

Start with a lowercase English letter, and use lowercase letters, numbers, and special characters (-) within 3 to 11 characters

Must not end with a special character (-)

Network Settings

Required

Set the network where the GPU Node will be installed

VPC Name:Select a pre-created VPC

General Subnet Name: Select a pre-created general Subnet
- IP can be set to auto-generate or user input, and if input is selected, the user enters the IP directly

NAT: Can be used only when there is 1 server and the VPC has an Internet Gateway attached. Checking ‘use’ allows selection of a NAT IP. (When first created, it is generated only with 2 or more servers, so modify on the resource detail page)

NAT IP: Select NAT IP
- If there is no NAT IP to select, click the Create New button to generate a Public IP
- Click the Refresh button to view and select the created Public IP
- Creating a Public IP incurs charges according to the Public IP pricing policy

Table. GPU Node required information entry items

Cluster selection area, create or select a Cluster Fabric.

Category	Required	Detailed description
Cluster Fabric	Required	Setting of a group of GPU Node servers that can apply GPU Direct RDMA together Optimal GPU performance and speed can be secured only within the same Cluster Fabric When creating a new Cluster Fabric, select *New Input > Node pool, then enter the name of the Cluster Fabric to be created To add to an existing Cluster Fabric, select Existing Input > Node pool, then select the already created Cluster Fabric

Category

Required

Detailed description

Cluster Fabric

Required

Setting of a group of GPU Node servers that can apply GPU Direct RDMA together

Optimal GPU performance and speed can be secured only within the same Cluster Fabric

When creating a new Cluster Fabric, select *New Input > Node pool, then enter the name of the Cluster Fabric to be created

To add to an existing Cluster Fabric, select Existing Input > Node pool, then select the already created Cluster Fabric

Table. GPU Node Cluster Fabric selection items

Additional Information Input Enter or select the required information in the area.

Category	Required or not	Detailed description
Lock	Select	Using Lock prevents accidental actions that could terminate/start/stop the server
Init Script	Select	Script to run when the server starts Init Script must be selected differently depending on the image type For Linux: Select Shell Script or cloud-init
Tag	Select	Add Tag Up to 50 can be added per resource After clicking the Add Tag button, enter or select Key, Value values

Category

Required or not

Detailed description

Lock

Select

Using Lock prevents accidental actions that could terminate/start/stop the server

Init Script

Select

Script to run when the server starts

Init Script must be selected differently depending on the image type
- For Linux: Select Shell Script or cloud-init

Tag

Select

Add Tag

Up to 50 can be added per resource

After clicking the Add Tag button, enter or select Key, Value values

Table. GPU Node additional information input items

Summary Check the detailed information and estimated billing amount generated in the panel, and click the Complete button.
- Once creation is complete, check the created resources on the GPU Node List page.

Caution

When creating a service, the GPU MIG/ECC settings are reset. However, to apply the correct settings, perform a one-time reboot initially, verify whether the settings have been applied, and then use it.
For detailed information about GPU MIG/ECC settings reset, please refer to GPU MIG/ECC Settings Reset Checklist Guide.

GPU Node Check Detailed Information

The Multi-node GPU Cluster service allows you to view and edit the full list of GPU Node resources and detailed information.

GPU Node Details page consists of Details, Tags, Job History tabs.

To view detailed information of the GPU Node, follow the steps below.

All Services > Compute > Multi-node GPU Cluster > GPU Node Click the menu. Navigate to the Service Home page of Multi-node GPU Cluster.

Click the GPU Node menu on the Service Home page. Navigate to the GPU Node List page.

Resource items other than required columns can be added via the Settings button.

Category	Required or not	Detailed description
Resource ID	Select	User-created GPU Node ID
Cluster Fabric name	required	Cluster Fabric name created by the user
Server name	Required	User-created GPU Node name
Server Type	Required	Server type of GPU Node User can check the number of cores, memory capacity, GPU type and count of the created resources
Image	Required	User-generated GPU Node image version
IP	Required	IP of the GPU Node created by the user
Status	Required	Status of the GPU Node created by the user
Creation Time	Select	GPU Node creation time

Table. GPU Node Resource List Items

GPU Node List Click the resource to view detailed information. GPU Node Details You will be taken to the page.

GPU Server Details At the top of the page, status information and descriptions of additional features are displayed.

Category	Detailed description
GPU Node status	Status of GPU Node created by the user Creating: State where the server is being created Running:: State where creation is complete and usable Editing:: State where IP is being changed Unknown: Error state Starting: State where the server is starting Stopping: State where the server is stopping Stopped: State where the server has stopped Terminating: State where termination is in progress Terminated: State where termination is complete
Server Control	Button to change server status Start: Start a stopped server Stop: Stop a running server
Service cancellation	Button to cancel the service

Category

Detailed description

GPU Node status

Status of GPU Node created by the user

Creating: State where the server is being created

Running:: State where creation is complete and usable

Editing:: State where IP is being changed

Unknown: Error state

Starting: State where the server is starting

Stopping: State where the server is stopping

Stopped: State where the server has stopped

Terminating: State where termination is in progress

Terminated: State where termination is complete

Server Control

Button to change server status

Start: Start a stopped server

Stop: Stop a running server

Service cancellation

Button to cancel the service

Table. GPU Node status information and additional features

Detailed Information

GPU Node List page’s Details Tab you can view the detailed information of the selected resource, and if necessary, edit the information.

Category	Detailed description
Service	Service Name
Resource Type	Resource Type
SRN	Unique resource ID in Samsung Cloud Platform In GPU Node, it means GPU Node SRN
Resource Name	Resource Name In the GPU Node service, it means the GPU Node name
Resource ID	Unique resource ID in the service
Creator	User who created the service
Creation time	Service creation time
Modifier	User who edited the service information
Modification date/time	Date and time when the service information was modified
Server name	Server name
Node pool	A group of nodes that can be grouped together with the same Cluster Fabric
Cluster Fabric name	Cluster Fabric name created by the user
Image/Version	Server’s OS image and version
Server Type	CPU, Memory, GPU, Information Display
Planned Compute	Resource status with Planned Compute set For more details, refer to Apply for Planned Compute
Lock	Display lock usage status If lock is used, it prevents server termination/start/stop to avoid accidental actions If you need to change the lock attribute value, click the Edit button to set
Network	GPU Node network information VPC name, general Subnet name, IP, IP status, NAT IP, NAT IP status
Block Storage	Block Storage information connected to the server Volume name, disk type, capacity, status
Init Script	View the Init Script content entered when creating the server

Table. GPU Node detailed information tab items

Category	Detailed description
Tag List	Tag List Tag’s Key, Value information can be checked Up to 50 tags can be added per resource When entering a tag, search and select from the existing list of Keys and Values

Work History

GPU Node List page’s Job History tab allows you to view the job history of the selected resource.

Category	Detailed description
Work History List	Resource Change History Work details, work date and time, resource type, resource name, event topic, work result, check worker information Detailed Search button provides detailed search function

Category

Detailed description

Work History List

Resource Change History

Work details, work date and time, resource type, resource name, event topic, work result, check worker information

Detailed Search button provides detailed search function

Table. GPU Node Job History Tab Detailed Information Items

GPU Node Operation Control

If you need server control and management functions for the generated GPU Node resources, you can perform tasks on the GPU Node List or GPU Node Details page. You can start and stop the running GPU Node resources.

GPU Node Getting Started

You can start a stopped GPU Node. To start the GPU Node, follow the steps below.

All Services > Compute > Multi-node GPU Cluster Click the menu. Navigate to the Service Home page of Multi-node GPU Cluster.
Click the GPU Node menu on the Service Home page. You will be taken to the GPU Node List page.
- On the GPU Node List page, after selecting individual or multiple servers with the checkbox, you can Start via the More button at the top.
GPU Node List page, click the resource. GPU Node Details page will be opened.
- GPU Node Details on the page, click the Start button at the top to start the server.
Check the server status and complete the status change.

Stop GPU Node

You can stop a GPU Node that is active. To stop the GPU Node, follow the steps below.

All Services > Compute > Multi-node GPU Cluster Click the menu. Move to Multi-node GPU Cluster’s Service Home page.
Click the GPU Node menu on the Service Home page. You will be taken to the GPU Node List page.
- GPU Node List page, you can control individual or multiple servers by selecting the checkboxes and then using the Stop button at the top.
GPU Node List page, click the resource. GPU Node Details page, navigate.
- GPU Node Details on the page, click the Stop button at the top to stop the server.
Check the server status and complete the status change.

GPU Node Cancel

You can cancel unused GPU nodes to reduce operating costs. However, if you cancel the service, the running service may be stopped immediately, so you should consider the impact of service interruption sufficiently before proceeding with the cancellation.

Caution

Please note that data cannot be recovered after service termination.

To cancel the GPU Node, follow the steps below.

All Services > Compute > Multi-node GPU Server Click the menu. Navigate to the Service Home page of the Multi-node GPU Cluster.
Click the Cluster Fabric menu on the Service Home page. You will be taken to the Cluster Fabric List page.
Cluster Fabric List page, select the resource to cancel, and click the Cancel Service button.
- Resources using the same Cluster Fabric can be terminated simultaneously.
Once the termination is complete, check on the GPU Node List page whether the resources have been terminated.

Guide

The cases where GPU Node termination is not possible are as follows.

When Block Storage (BM) is connected: Please disconnect the Block Storage (BM) connection first.
If File Storage is connected: Please disconnect the File Storage first.
When Lock is set: Please change the Lock setting to unused and try again.
If the server that cannot be terminated simultaneously is included: Please re-select only the resources that can be terminated.
If the Cluster Fabric of the server you want to terminate is different: Select only resources that use the same Cluster Fabric.

Reference

If all GPU Nodes in the Cluster Fabric are deleted, the Cluster Fabric is automatically deleted.

Monitoring Metrics

Cluster Fabric Management