The page has been translated by Gen AI.

How-to guides

Users can create the service by entering the required information for the Multi-node GPU Cluster service and selecting detailed options through the Samsung Cloud Platform Console.

Multi-node GPU Cluster Getting Started

You can create and use a Multi-node GPU Cluster service in the Samsung Cloud Platform Console.

This service consists of a GPU Node and a Cluster Fabric service.

Create GPU Node

Multi-node GPU Cluster To create a Multi-node GPU Cluster, follow the steps below.

  1. All Services > Compute > Multi-node GPU Cluster menu, click it. Navigate to the Service Home page of Multi-node GPU Cluster.
  2. On the Service Home page, click the Create GPU Node button. You will be taken to the Create GPU Node page.
  3. On the GPU Node Creation page, enter the information required to create the service and select detailed options.
    • Select the required information in the Image and Version Selection area.
      Category
      required or not
      Detailed description
      imageRequiredSelect the type of image provided
      • Ubuntu
      Image versionRequiredSelect version of the chosen image
      • Provide a list of versions for the supplied server image
      Table. GPU Node image and version selection options
    • In the Service Information Input area, enter or select the required information.
      Category
      Required
      Detailed description
      Number of serversRequiredNumber of GPU Node servers to create simultaneously
      • Only numeric input is allowed, and the minimum number of servers to create is 2.
      • Only during the initial setup can you create 2 or more, and subsequent expansions can be done one at a time.
      Service Type > Server TypeRequiredGPU Node server type
      • Select the desired CPU, Memory, GPU, and Disk specifications
      Service Type > Planned ComputeRequiredStatus of resources with Planned Compute configured
      • In Use: Number of resources with Planned Compute that are currently in use
      • Configured: Number of resources with Planned Compute configured
      • Coverage Preview: Amount applied per resource by Planned Compute
      • Apply for Planned Compute Service: Navigate to the Planned Compute service application page
      Table. GPU Node Service Information Input Items
    • In the Required Information Input area, enter or select the necessary information.
      Category
      Required
      Detailed description
      Administrator accountRequiredSet the administrator account and password to be used when connecting to the server
      • Ubuntu OS is provided with root fixed
      Server name PrefixRequiredEnter a Prefix to distinguish each GPU Node generated when the selected number of servers is 2 or more
      • Automatically generated in the form of user input value (prefix) + ‘-###
      • Must start with a lowercase English letter and be entered using lowercase letters, numbers, and special characters (-) within 3 to 11 characters
      • Must not end with a special character (-)
      Network SettingsRequiredSet the network where the GPU Node will be installed
      • VPC name: select a pre‑created VPC
      • General Subnet name: select a pre‑created general Subnet
        • IP can be auto‑generated or manually entered; if manual input is chosen, the user enters the IP directly
      • NAT: usable only when there is a single server and the VPC is attached to an Internet Gateway. Check to use, and you can select a NAT IP (initially it can be created only with two or more servers, so modify it on the resource detail page)
      • NAT IP: select a NAT IP
        • If no NAT IP is available, click the Create new button to generate a Public IP
        • Click the Refresh button to view and select the created Public IP
        • Creating a Public IP incurs charges according to the Public IP pricing
      Table. Required input fields for GPU Node
    • In the Cluster Selection area, create or select a Cluster Fabric.
      Category
      Required status
      Detailed description
      Cluster FabricRequiredConfigure a group of GPU Node servers that can apply GPU Direct RDMA together
      • Optimal GPU performance and speed can be achieved only within the same Cluster Fabric
      • When creating a new Cluster Fabric, *New Input > select Node pool, then enter the name of the Cluster Fabric to create
      • To add to an existing Cluster Fabric, Existing Input > select Node pool, then select the previously created Cluster Fabric
      Table. GPU Node Cluster Fabric selection items
    • Additional Information Input area, enter or select the required information.
      Category
      Required status
      Detailed description
      LockSelectionUsing a lock prevents accidental actions that could terminate, start, or stop the server.
      Init ScriptSelectionScript to run when the server starts
      • The Init Script must be selected differently depending on the image type
        • For Linux: Choose Shell Script or cloud-init
      tagSelectionAdd Tag
      • Up to 50 can be added per resource
      • After clicking the Add Tag button, enter or select Key, Value values
      Table. GPU Node additional information input fields
  4. Summary Check the detailed information and estimated billing amount generated in the panel, and click the Create button.
    • When creation is complete, check the created resources on the GPU Node List page.
Caution
  • When creating a service, the GPU MIG/ECC settings are reset. However, to ensure the correct settings are applied, perform an initial reboot, verify that the settings have taken effect, and then proceed.
  • For detailed information on resetting GPU MIG/ECC settings, refer to the GPU MIG/ECC Reset Settings Checklist Guide.

Check GPU Node details

The Multi-node GPU Cluster service allows you to view and edit the full resource list and detailed information of GPU nodes.

GPU Node Details page consists of Details, Tags, Job History tabs.

To view detailed information about the GPU Node, follow these steps.

  1. All Services > Compute > Multi-node GPU Cluster > GPU Node menu, click it. Navigate to the Service Home page of Multi-node GPU Cluster.

  2. On the Service Home page, click the GPU Node menu. You will be taken to the GPU Node List page.

    • Resource items beyond the required columns can be added via the Settings button.
      Category
      Required status
      Detailed description
      Resource IDSelectionUser-created GPU Node ID
      Cluster Fabric nameRequiredUser-created Cluster Fabric name
      Server nameRequiredUser-created GPU Node name
      Server typeRequiredServer type of the GPU Node
      • Users can view the number of cores, memory capacity, and GPU type and quantity of the resources they created
      imageRequiredUser-created GPU Node image version
      IPRequiredIP of the GPU node created by the user
      statusRequiredStatus of the user-created GPU node
      Creation date and timeSelectionGPU Node creation timestamp
      Table. GPU Node resource list items
  3. On the GPU Node List page, click the resource for which you want to view detailed information. You will be taken to the GPU Node Details page.

    • GPU Server Details At the top of the page, status information and descriptions of additional features are displayed.
      CategoryDetailed description
      GPU Node statusStatus of user-created GPU Node
      • Creating: server is being created
      • Running:: creation completed and available
      • Editing:: IP is being changed
      • Unknown: error state
      • Starting: server is starting
      • Stopping: server is stopping
      • Stopped: server stopped
      • Terminating: terminating
      • Terminated: termination completed
      Server controlButton to change server status
      • Start: Start a stopped server
      • Stop: Stop a running server
      Service terminationCancel service button
      Table. GPU Node status information and additional features

Detailed Information

On the GPU Node List page’s Details Tab, you can view the detailed information of the selected resource and edit the information if necessary.

CategoryDetailed description
serviceService name
Resource TypeResource Type
SRNUnique resource ID in Samsung Cloud Platform
  • In a GPU Node, it refers to the GPU Node SRN
Resource nameResource Name
  • In the GPU Node service, it refers to the GPU Node name
Resource IDUnique resource ID in the service
constructorUser who created the service
Creation date and timeService creation date and time
editorUser who edited the service information
Modification dateDate and time the service information was modified
Server nameServer name
Node poolA collection of nodes that can be grouped into the same Cluster Fabric
Cluster Fabric nameUser-created Cluster Fabric name
Image/VersionServer OS image and version
Server typeCPU, memory, GPU, information display
Planned ComputeResource status with Planned Compute configured
LockDisplay whether Lock is enabled or disabled
  • When Lock is enabled, it prevents server termination/start/stop operations, avoiding actions caused by mistakes.
  • If you need to change the Lock attribute value, click the Edit button to configure.
NetworkGPU Node network information
  • VPC name, general Subnet name, IP, IP status, NAT IP, NAT IP status
Block StorageBlock Storage information attached to the server
  • Volume name, disk type, capacity, status
Init ScriptView the Init Script content entered when creating the server
Table. GPU Node detailed information tab items

tag

On the GPU Node List page’s Tag Tab, you can view the selected resource’s tag information, and add, modify, or delete it.

CategoryDetailed description
Tag listTag list
  • You can view the Key and Value information of the tag
  • Up to 50 tags can be added per resource
  • When entering a tag, you can search and select from the list of previously created Keys and Values
Table. GPU Node Tag Tab Items

Job History

On the GPU Node List page’s Job History Tab, you can view the job history of the selected resource.

CategoryDetailed description
Task History ListResource Change History
  • Check operation details, operation timestamp, resource type, resource name, event topic, operation result, and operator information
  • Detailed Search button provides detailed search functionality
Table. GPU Node Job History Tab Detailed Information Items

Control GPU Node Operation

If you need server control and management functions for the created GPU Node resources, you can perform tasks on the GPU Node List or GPU Node Details page. You can start and stop the resources of a running GPU node.

Getting Started with GPU Node

You can start a GPU Node that is stopped (Stopped). To start a GPU Node, follow the steps below.

  1. Click the All Services > Compute > Multi-node GPU Cluster menu. Go to the Service Home page of Multi-node GPU Cluster.
  2. On the Service Home page, click the GPU Node menu. You will be taken to the GPU Node List page.
    • On the GPU Node List page, after selecting individual or multiple servers with the checkboxes, you can Start using the More button at the top.
  3. On the GPU Node List page, click the resource. Navigate to the GPU Node Detail page.
    • On the GPU Node Details page, click the Start button at the top to start the server.
  4. Check the server status and complete the status change.

Stopping GPU Node

You can stop a GPU Node that is running (Active). To stop a GPU Node, follow the steps below.

  1. Click the All Services > Compute > Multi-node GPU Cluster menu. Go to the Service Home page of Multi-node GPU Cluster.
  2. On the Service Home page, click the GPU Node menu. You will be taken to the GPU Node List page.
    • GPU Node List page allows you to control individual or multiple servers by selecting the checkboxes and using the Stop button at the top.
  3. On the GPU Node List page, click the resource. You will be taken to the GPU Node Details page.
    • On the GPU Node Details page, click the Stop button at the top to stop the server.
  4. Check the server status and complete the status change.

Terminate GPU Node

You can terminate unused GPU Nodes to reduce operating costs. However, terminating a service may cause the running service to stop immediately, so you should carefully consider the impact of service interruption before proceeding with termination.

Caution
Please note that data cannot be recovered after terminating the service.

To cancel a GPU Node, follow the steps below.

  1. Click the All Services > Compute > Multi-node GPU Server menu. Navigate to the Service Home page of the Multi-node GPU Cluster.
  2. On the Service Home page, click the Cluster Fabric menu. You will be taken to the Cluster Fabric list page.
  3. Cluster Fabric List page, select the resources to terminate, and click the Service Termination button.
    • Resources that use the same Cluster Fabric can be terminated simultaneously.
  4. When termination is complete, check on the GPU Node List page whether the resources have been terminated.
information

The cases where a GPU Node cannot be terminated are as follows.

  • When Block Storage(BM) is connected: Please disconnect the Block Storage(BM) connection first.
  • If File Storage is connected: Please disconnect the File Storage first.
  • When Lock is set: Please change the Lock setting to disabled and try again.
  • If the selection includes a server that cannot be terminated simultaneously: Please re-select only resources that can be terminated.
  • If the server you want to terminate has a different Cluster Fabric: Select only resources that use the same Cluster Fabric.
Reference
If all GPU Nodes in the Cluster Fabric are deleted, the Cluster Fabric is automatically deleted.
Monitoring Metrics
Manage Cluster Fabric