This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

AI-ML

We offer AI/ML services that enable easy and convenient development of ML/DL (Machine Learning/Deep Learning) models and the setup of training environments.

1 - CloudML

1.1 - Overview

Service Overview

CloudML is an integrated platform that supports the entire machine learning process—from data analysis to model development, training, validation, and deployment—in a cloud environment.

Features

  • Cloud ML is designed to enable users in various roles such as analysts, machine learning engineers, and developers to collaborate in a single environment and easily design and operate machine learning workflows.
  • Cloud ML provides an analysis environment based on Python and R, and users with programming experience can leverage the platform more flexibly and effectively. In particular, using the generative AI–based Copilot feature allows code writing, refactoring, error correction, and function recommendation to be performed easily with natural language input, thereby increasing analytical productivity and accessibility.
  • Cloud ML systematically supports each stage, including configuring the analysis environment, model development and serving, analysis automation, and visualization. It enables improvements in both productivity and model quality through repetitive experiments and operational automation.

Service Architecture Diagram

CloudML consists of an analysis environment, machine learning lifecycle management, automated analysis support, visualization, and a generative AI‑based Copilot feature, allowing users to perform the entire machine‑learning process in an integrated manner.

Architecture diagram
Figure. CloudML architecture diagram

Provided features

CloudML provides the following features.

  • Visual Modeling: Provides an intuitive interface that lets you build and deploy machine learning models without coding using a Drag&Drop approach. You can easily manage the entire process from data loading to model evaluation and deployment.
  • Code-based Development: In the Jupyter Notebook environment, you can freely write and execute code using Python, R, and others. It provides powerful features for advanced users and researchers.
  • Workflow Automation: It efficiently automates complex machine learning workflows such as data preprocessing, model training, evaluation, and deployment.
  • Experiment Management: You can train machine learning models with various parameter combinations and systematically manage and compare the results.
  • Using Copilot Features: It provides a natural-language-based AI assistant that guides and automates the model development process. It supports various tasks such as code generation, refactoring, error correction, and documentation, enhancing productivity.
  • Integrated Platform: All features are integrated within CloudML for convenient use.
  • Scalability and Flexibility: Supports scaling computing resources and connecting various data sources as needed.

Constraints

Before using CloudML, be sure to check the following constraints and incorporate them into your service usage plan. Since Cloud ML operates in a Kubernetes-based environment, appropriate cluster resource configuration is required for stable service operation.

  • Application Basic Resources: To run the Application, a minimum of 24 vCPU cores and 96 GBi of memory are allocated by default.
  • Analysis Task Resources: To perform analysis tasks, additional CPU or GPU resource configuration is required beyond the basic resources above. It should be configured appropriately, taking the workload of the analysis tasks into account.
  • Copilot (CPU-based usage): To run Copilot on CPU resources, a minimum of 16 vCPU cores and 10 GiB of memory are required. In this case, the CPU resources available for analysis tasks are reduced accordingly.
  • Copilot (GPU-based usage): Copilot can also be configured to use dedicated GPU resources.
  • Supported LLM models: Currently, the LLM models that can be applied to Copilot are limited to Llama3.

Provision status by region

CloudML is available in the following environments.

regionAvailability
Korea West (kr-west1)Provide
Korea East (kr-east1)Provide
South Korea South 1 (kr-south1)Not provided
South Korea South 2 (kr-south2)Not provided
South Korea South 3 (kr-south3)Not provided
Table. CloudML regional availability status

Preliminary Service

This is a list of services that must be pre-configured before creating the service. Please refer to the guide provided for each service for details and prepare in advance.

Service CategoryserviceDetailed description
ContainerContainer RegistryA service that stores, manages, and shares container images.
ContainerKubernetes EngineKubernetes container orchestration service
NetworkingLoad BalancerA service that automatically distributes server traffic load.
Table. CloudML Prerequisite Services

1.2 - How-to guides

Create CloudML

Users can create the service by entering the required CloudML information and selecting detailed options through the Samsung Cloud Platform Console.

To create a CloudML, follow these steps.

  1. Click the All Services > AI/ML > CloudML menu. Navigate to CloudML’s Service Home page.

  2. On the Service Home page, click the Create CloudML button. You will be taken to the CloudML page.

  3. On the CloudML Creation page, enter the information required to create the service and select detailed options.

    • In the Version Selection area, select the version of the service.

      Category
      Required
      Detailed description
      Select versionRequiredSelect CloudML version
      Table. CloudML service version selection options

    • SCP Kubernetes Engine deployment Select the options needed to create a service in this area.

      Category
      Required
      Detailed description
      Cluster nameRequiredSelect Kubernetes Engine cluster
      Table. CloudML Service Cluster Selection Options

    • In the Service Information Input area, select the options required to create the service.

      Category
      required or not
      Detailed description
      CloudML nameRequiredEnter service name
      ExplanationSelectionEnter service description
      Domain nameRequiredEnter the domain name to be used for the service
      • Enter 2-63 characters using lowercase English letters, numbers, and special characters
      endpointRequiredSelect the endpoint to use in the service
      • Choose between Private and Public
      CopilotSelectionSelect whether to use Copilot in the service
      • Apply when selected requires agreement to terms in the popup window
      • If the selected cluster is not configured with GPUs dedicated to LLMs, or the allocated LLM resources are insufficient, Copilot cannot be applied
      Resource InformationRequiredDisplay resource information of the selected cluster
      Enter SCR informationRequiredEnter SCR information to be used in the service
      • Enter private endpoint, authentication key, secret key
      Table. CloudML service information input items

    • Additional Information Input area, please enter or select the required information.

      Category
      Required
      Detailed description
      tagSelectionAdd Tag
      • Up to 50 can be added per resource
      • After clicking the Add Tag button, enter or select Key, Value values
      Table. CloudML Additional Information Input Items

  4. Summary Check the detailed information and estimated billing amount generated in the panel, and click the Complete button.

    • When creation is complete, check the created resources on the CloudML List page.

Check CloudML detailed information

You can view and edit the full list of resources and detailed information for the CloudML service. CloudML Details page consists of Details, Tags, Activity Log tabs.

To view the detailed information of CloudML, follow these steps.

  1. Click the All Services > AI/ML > CloudML menu. Navigate to CloudML’s Service Home page.
  2. On the Service Home page, click the resource (CloudML) to view detailed information. You will be taken to the CloudML Details page.
    • CloudML Details page displays CloudML’s status information and detailed information, and consists of Details, Tags, Activity History tabs.
      CategoryDetailed description
      Service statusCloudML status
      • Creating: Creating
      • Deployed: Created / operating normally
      • Updating: Updating settings
      • Terminating: Terminating
      • Error: Error occurred
      Connection GuideService Access Guide
      • Information on host to register on the user’s PC
      Service terminationCancel Service button
      Table. CloudML status information and additional features

Detailed Information

CloudML List page lets you view detailed information of the selected resource and modify it if necessary.

CategoryDetailed description
serviceService name
Resource TypeResource Type
SRNUnique resource ID in Samsung Cloud Platform
Resource nameResource name
Resource IDUnique resource ID in the service
constructorUser who created the service
Creation date and timeService creation date and time
editorUser who edited the service information
Modification dateDate and time the service information was modified
Product nameCloudML name
CopilotWhether to use Copilot
ExplanationDescription of the service
Cluster nameSelected Kubernetes Engine cluster name
domain nameEntered service domain name
VersionSelected service version
Installation node informationNode information installed on the cluster
SCR informationEntered SCR information
Table. CloudML detailed information items

tag

On the CloudML List page, you can view the tag information of the selected resource, and add, modify, or delete it.

CategoryDetailed description
Tag listTag list
  • You can view the Key and Value information of the tag
  • Up to 50 tags can be added per resource
  • When entering a tag, you can search and select from the list of previously created Keys and Values
Table. CloudML Tag Tab Items

Job History

On the CloudML list page, you can view the operation history of the selected resource.

CategoryDetailed description
Task History ListResource Change History
  • You can view the operation date and time, resource type, resource name, operation details, operation result, operator name, and path information
  • To perform an advanced search, click the Advanced Search button
Table. Work History Tab Detailed Information Items

Terminate CloudML Service

Users can cancel the CloudML service through the Samsung Cloud Platform Console.

Reference
If the CloudML service status is Creating, Updating, or Terminating, the service cannot be terminated.

To cancel CloudML, follow the steps below.

  1. Click the All Services > AI/ML > CloudML menu. Navigate to CloudML’s Service Home page.
  2. Click the Cancel Service button on the Service Home page. A service cancellation alert window appears.
  3. Enter the CloudML name to delete in the dialog and click the Confirm button.

1.2.1 - Kubernetes Cluster Configuration

Configuring a Kubernetes cluster

To apply for the CloudML service, a dedicated cluster for CloudML must be set up. A dedicated cluster means creating a Kubernetes Engine that meets or exceeds the required minimum specifications and configuring several necessary settings. Create a dedicated cluster in advance before applying for the CloudML service.

  • For instructions on creating a cluster, see the Cluster Creation guide.
  • CloudML exposes an HTTPS endpoint on port 443. When creating a cluster, select Public Endpoint.

Recommended specifications for cluster nodes and storage

Cluster nodes can be added or modified after the cluster is created. The following are the recommended specifications for cluster nodes and storage that should be prepared to install CloudML for five users.

CategoryItemrolecapacity
cluster nodeKubernetes node pool (Virtual Server)Application execution
  • node.kubernetes.io/nodetype: ml-app
24 core / 96 GBi
Cluster nodeKubernetes node pool (Virtual Server)Run Analysis
  • node.kubernetes.io/nodetype: ml-analytics
8 core / 32 GBi x 2 EA
  • Total 16 core / 64 GBi
repositoryFile StorageData storage1 TB
Table. Recommended specifications for cluster nodes and storage items
Notice

If you need to change specifications such as adjusting the number of nodes, adding GPU nodes, or expanding resources, please request technical support.

Add a label to a node

Add labels to the nodes directly according to the role-specific recommendations in the cluster node and storage specifications.

  • For instructions on adding labels to a node YAML, refer to the Edit Node YAML guide.

To add a label to a cluster node, follow these steps.

  1. Click the All Services > Container > Kubernetes Engine menu. Navigate to the Service Home page of Kubernetes Engine.
  2. On the Service Home page, click the Node menu. You will be taken to the Node List page.
  3. On the Node List page, select the cluster for which you want to view detailed information from the gear button at the top left, then click the Confirm button.
  4. Select the node you want to view details for and click it. You will be taken to the Node Details page.
  5. On the Node Details page, click the YAML tab. You will be taken to the YAML tab page.
  6. On the YAML tab page, click the Edit button. The node edit window opens.
  7. In the node edit window, add a label that matches the role and click the Save button.
    • Check the following information and add a label that matches the node specifications.
      CategoryPurpose-specific labels
      CPU node
      • App: node.kubernetes.io/nodetype: ml-app
      • Analytics: node.kubernetes.io/nodetype: ml-analytics
      GPU node
      • Analysis: node.kubernetes.io/nodetype: ml-analytics-gpu
      • Copilot: node.kubernetes.io/nodetype: ml-gpu
      Table. Kubernetes node label items by purpose

1.3 - API Reference

API Reference

1.4 - CLI Reference

CLI Reference

1.5 - Release Note

CloudML

2025.07.01
NEW CloudML service official version release
  • We have launched the CloudML service, which supports the entire machine learning process—from data analysis to model development, training, validation, and deployment—in a cloud environment through the Samsung Cloud Platform.

2 - AI&MLOps Platform

2.1 - Overview

Service Overview

AI&MLOps Platform is a machine learning platform that automates repetitive tasks across the entire pipeline of developing, training, and deploying machine learning models. Through the AI&MLOps Platform service, integrated management of training data, models, and operational data is possible on a Kubernetes-based AI/MLOps environment.

The AI&MLOps Platform provides an Enterprise service that adds add-on features such as distributed training job execution and monitoring to the open-source product Kubeflow.Mini, which enables development, training, tuning, and deployment of machine learning models.

Reference
For AI&MLOps Platform related sites, refer to Kubeflow.

Features

  • Providing a Cloud Native MLOps Environment: The AI&MLOps Platform provides a cloud‑optimized machine learning model development environment, and its Kubernetes‑based architecture makes integration with various open‑source tools convenient.

  • Machine Learning Development and Operations Convenience: Provides a standardized environment that supports various machine learning frameworks such as TensorFlow, PyTorch, scikit-learn, Keras, etc. By automating the entire pipeline for developing, training, and deploying machine learning models, it makes model composition and creation easy and promotes reusability.

  • Enhanced GPU Integration: By leveraging Multi‑Node GPU on a Bare Metal Server and GPUDirect RDMA (Remote Direct Memory Access), the job speed of LLM (Large Language Model) and natural language processing (NLP) can be dramatically improved.

Service Diagram

Diagram
Figure. AI&MLOps Platform Diagram

Provided features

The AI&MLOps Platform provides the following features.

  • ML Model Development Environment and Features

    • Notebook Provision: Creates Jupyter Notebooks and VS Code that include ML frameworks such as Tensorflow, Pytorch, etc.
    • TensorBoard: TensorBoard(ML model training process visualization/analysis tool) creates and manages the server.
    • Volumes: When developing ML models, store datasets and models, and connect a Volume when creating a Jupyter Notebook.
  • ML model distributed training Job execution/management

    • Supports execution and monitoring of distributed training jobs, as well as management and analysis of inference services. (Add-on)
    • Provides various features for configuring MLOps environments, such as Job Queue management. (Add-on)
    • Provides efficient GPU resource utilization features such as Job Scheduler (FIFO, Bin-packing, Gang-based), GPU Fraction, and GPU resource monitoring, etc. (Add-on)
    • We dramatically improved the job speed of LLM (Large Language Model) and natural language processing (NLP) by using BM-based Multi-Node GPU and GPU Direct RDMA (Remote Direct Memory Access). (Add-on)
  • ML Model Experiment Management and Pipeline

    • Provides Experiments (KFP) for managing ML pipeline experiments.
    • Supports pipeline automation features for configuring and executing ML tasks in stages.

Component

Operating System version

The operating systems supported by the AI&MLOps Platform are as follows.

Operating System (OS)Version
RHELRHEL 8.3
UbuntuUbuntu 18.04, Ubuntu 20.04, Ubuntu 22.04
Table. Supported Operating System Versions

Provision status by region

The AI&MLOps Platform is available in the environments below.

regionProvision status
Korea West (kr-west1)Provide
Korea East (kr-east1)Provide
South Korea South 1 (kr-south1)Not provided
South Korea South 2 (kr-south2)Not provided
South Korea South 3(kr-south3)Not provided
Table. AI&MLOps Platform regional availability status

Prior Service

This is a list of services that must be pre-configured before creating the service. For details, refer to the guide provided for each service and prepare in advance.

Service CategoryserviceDetailed description
ContainerKubernetes EngineKubernetes container orchestration service
Table. AI&MLOps Platform Preliminary Services

2.2 - How-to guides

Create AI&MLOps Platform

Users can create the service by entering the required information for the AI&MLOps Platform and selecting detailed options through the Samsung Cloud Platform Console.

To create an AI&MLOps Platform, follow these steps.

  1. Click the All Services > AI/ML > AI&MLOps Platform menu. You will be taken to the Service Home page of AI&MLOps Platform.
  2. Service Home page, click the AI&MLOps Platform Create button. You will be taken to the AI&MLOps Platform Create page.
  3. On the AI&MLOps Platform creation Service Type Selection page, enter the information required to create the service and select detailed options.
    • Select the service type in the Service Type and Version Selection area.
      Category
      Required
      Detailed description
      Service typeRequiredService type selected by the user
      • AI&MLOps Platform
      • Kubeflow Mini
      Service type versionRequiredSelect version of the selected service
      • Provide a list of versions of the offered service
      Table. AI&MLOps Platform service types and version selection items
    • Cluster Deployment Area Classification Select the options required to create a service in this area.
      Category
      Required
      Detailed description
      Cluster deployment areaRequired
      • Deploy from Kubernetes Engine: Select the previously created Kubernetes Engine
      • Deploy to a new cluster: When creating the AI&MLOps Platform, also create a Kubernetes Engine
      Table. AI&MLOps Platform Service Cluster Deployment Area Classification Items
      Reference
      The configuration elements on the following Service Information Input page vary depending on the cluster deployment settings.
  4. On the Service Information Input page of AI&MLOps Platform Creation, enter the information required to create the service and select detailed options.
  5. On the Creation Information Check page of AI&MLOps Platform creation, review the detailed information you created and the estimated billing amount, and click the Complete button.
    • Once creation is complete, check the created resources on the AI&MLOps Platform Service List page.

Check detailed information of AI&MLOps Platform

The AI&MLOps Platform service allows you to view and edit the full list of resources and detailed information. AI&MLOps Platform Service Details page consists of Details, Tags, Activity History tabs.

To view detailed information about the AI&MLOps Platform service, follow the steps below.

  1. Click the All Services > AI/ML > AI&MLOps Platform Service menu. Navigate to the Service Home page of the AI&MLOps Platform Service.
  2. On the Service Home page, click the AI&MLOps Platform menu. You will be taken to the AI&MLOps Platform Service List page.
  3. On the AI&MLOps Platform Service List page, click the resource to view detailed information. You will be taken to the AI&MLOps Platform Service Details page.
    • AI&MLOps Platform Service Details page displays status information and additional feature information, and consists of Details, Tags, Activity History tabs.

Detailed Information

AI&MLOps Platform Service List page lets you view detailed information of the selected resource and edit the information if needed.

Category
Detailed description
serviceService name
Resource TypeResource Type
SRNUnique resource ID in Samsung Cloud Platform
Resource nameResource name
  • In the AI&MLOps Platform service, it refers to the cluster name
Resource IDUnique resource ID in the service
constructorUser who created the service
Creation date and timeService creation date and time
editorUser who edited the service information
Modification dateDate and time the service information was modified
Dashboard statusDashboard status value
Service nameService name
Admin Email AddressAdministrator email address
image nameService image name
VersionImage version
Service typeDeployed service type
Table. AI&MLOps Platform Service Detailed Information Items

tag

AI&MLOps Platform Service List page lets you view the tag information of the selected resource, and you can add, modify, or delete it.

CategoryDetailed description
Tag listTag list
  • You can view the Key and Value information of the tag
  • Up to 50 tags can be added per resource
  • When entering a tag, you can search and select from the list of previously created Keys and Values
Table. Cluster Tag Tab Items

Job History

AI&MLOps Platform Service List page lets you view the operation history of the selected resource.

CategoryDetailed description
Task History ListResource Change History
  • You can view operation details, operation time, resource type, resource name, operation result, and operator information
  • Operation History List Click the relevant resource in the list. Operation History Details A popup window will open.
Table. AI&MLOps Platform Service Job History Tab Detailed Information Items

Access AI&MLOps Platform

To access the AI&MLOps Platform dashboard, you must complete the prerequisite steps.

Preliminary work

To access the AI&MLOps Platform, you must preconfigure the relevant ports and the IP addresses required for connection in the Security Group and Firewall (if using a firewall).

  • Kubeflow Mini: port 31390 (inbound rules of Security Group, VPC firewall)

  • To access the cluster’s worker node, you must set an inbound rule for port 22 on the Security Group and Firewall (when using a VPC firewall).

Access Dashboard

To access the AI&MLOps Platform service, follow these steps.

  1. Click the All Services > AI/ML > AI&MLOps Platform Service menu. You will be taken to the Service Home page of the AI&MLOps Platform service.
  2. Click the AI&MLOps Platform Service menu on the Service Home page. You will be taken to the AI&MLOps Platform Service List page.
  3. Click the resource to view detailed information on the AI&MLOps Platform Service List page. You will be taken to the AI&MLOps Platform Details page.
  4. AI&MLOps Platform Details on the page, click the Access Guide button. The Access Guide popup window opens.
  5. Access Guide In the popup window, click the dashboard’s URL link. You will be taken to the corresponding dashboard page.
Caution
When using a public subnet and assigning a public IP, you may be exposed to security attacks such as external hacking and malware infection.

Terminate AI&MLOps Platform

You can cancel the unused service to reduce operating costs. However, canceling the service may cause the running service to stop immediately, so you should thoroughly consider the impact of service interruption before proceeding with the cancellation.

Caution
Please note that data cannot be recovered after terminating the service.

To cancel the AI&MLOps Platform, follow the steps below.

  1. Click the All Services > AI/ML > AI&MLOps Platform Service menu. Navigate to the Service Home page of the AI&MLOps Platform Service.
  2. On the Service Home page, click the AI&MLOps Platform Service menu. You will be taken to the AI&MLOps Platform Service List page.
  3. Click the resource to view detailed information on the AI&MLOps Platform Service List page. You will be taken to the AI&MLOps Platform Details page.
  4. AI&MLOps Platform Details on the page, click the Cancel Service button. The Cancel Service popup will open.
  5. After entering the service name for verification, click Confirm.
  6. When termination is complete, check on the AI&MLOps Platform Service List page whether the resource has been terminated.

2.2.1 - Cluster Deployment

Cluster deployment area

In Samsung Cloud Platform, the AI&MLOps Platform creation’s service type selection provides two cloud deployment regions.

common

Before proceeding with the cluster deployment, be sure to verify the Kubernetes cluster specifications required for installation.

  • Regardless of the selected cluster deployment region, you must verify the Kubernetes cluster specifications in advance.
  • For detailed specification information, refer to the Cluster Specification guide.

Depending on the selection of the cluster deployment region, the installation details on the AI&MLOps Platform creation service information input page differ.

Deploy from SCP Kubernetes Engine

  1. Click the All Services > AI/ML > AI&MLOps Platform menu. You will be taken to the Service Home page of AI&MLOps Platform.
  2. On the Service Home page, click the AI&MLOps Platform Create button. It navigates to the AI&MLOps Platform Create page.
  3. On the service type selection page of AI&MLOps Platform creation, enter the information required to create the service and select detailed options.
    Cluster deployment
    Select the Deploy on SCP Kubernetes Engine option.
  4. On the Service Information Input page of AI&MLOps Platform Creation, enter the information required to create the service, and select detailed options.
    • In the Service Information Input area, enter or view the information required to create a service.
      Category
      Required status
      Detailed description
      Service nameRequiredEnter AI&MLOps Platform name
      • AI&MLOps Platform name cannot be duplicated within a project
      Storage ClassRequiredStorage Class is automatically registered
      Installation node informationLookupView the node information of the selected Kubernetes Engine
      Admin Email AddressRequiredEnter the administrator (Admin) email address to use for login
      passwordRequiredEnter the password to use for login
      Confirm PasswordRequiredRe-enter password to prevent password errors
      Table. AI&MLOps Platform service information input fields
    • Additional Information Input area: enter or select the information required to create a service.
      Category
      Required status
      Detailed description
      tagSelectionSelect tags to add to the AI&MLOps Platform
      • Click ‘Add Tag’ to create a new tag or add an existing tag
      • You can register up to 50 tags
      • The newly added tags will be applied after the service creation is completed
      Table. AI&MLOps Platform service additional information input fields

Deploy to a new cluster

  1. Click the All Services > AI/ML > AI&MLOps Platform menu. You will be taken to the Service Home page of AI&MLOps Platform.
  2. Service Home page, click the Create AI&MLOps Platform button. It navigates to the Create AI&MLOps Platform page.
  3. On the Service Type Selection page of the AI&MLOps Platform creation, enter the information required to create the service and select detailed options.
    Cluster deployment
    Select the Deploy to a new cluster option.
  4. On the Service Information Input page of AI&MLOps Platform creation, enter the information required to create a service and select detailed options.
    • In the Service Information Input area, enter or view the information required to create a service.

      Category
      Required
      Detailed description
      Service nameRequiredEnter AI&MLOps Platform name
      • AI&MLOps Platform name cannot be duplicated within a project
      Storage ClassRequiredStorage Class is automatically registered
      Installation node informationLookupView the node information of the selected Kubernetes Engine
      Admin Email AddressRequiredEnter the email address of the administrator (Admin) to be used for login.
      passwordRequiredEnter the password to use for login
      Confirm PasswordRequiredRe-enter password to prevent password errors
      Table. AI&MLOps Platform Service Information Input Items

    • Kubernetes Engine Information Input Enter or select the required information in this area.

      Category
      Required
      Detailed description
      Cluster nameRequiredCluster name
      • must start with an English letter and may use English letters, numbers, and special characters (-)
      • Enter within 3 to 30 characters
      Control plane settings > Kubernetes versionRequiredSelect Kubernetes version
      Control Area Settings > Control Area LoggingSelectionSelect whether to enable control plane logging
      • Audit/Event logs from the cluster control plane can be viewed in Cloud Monitoring’s log analysis
      • Log storage up to 1 GB for all services within the account is provided for free, and logs exceeding 1 GB are deleted sequentially
      Network SettingsRequiredNetwork connection settings for the node pool
      • VPC: Select a pre‑created VPC
      • Subnet: Select a standard Subnet to use from the subnets of the chosen VPC
      • Security Group: Click the Search button, then select a Security Group in the Select Security Group popup
      • Load Balancer: Provides the type:LoadBalancer feature in a Kubernetes Service object
        • Select a load balancer on the same network
        • Select whether to use
        • Cannot be changed after configuration
      File Storage SettingsRequiredSelect the file storage volume to use in the cluster
      • Default Volume (NFS): Select File Storage using the Search button
        • The default Volume file storage provides only the NFS format
      Table. Kubernetes Engine Service Information Input Items

    • Enter or select the required information in the Node Pool Information Input area.

      Category
      Required status
      Detailed description
      Node pool configurationRequiredSelect node pool information
      • * Items marked with an asterisk are required fields and must be entered
      • For the AI&MLOps Platform, image size may continuously increase depending on usage, so setting Block Storage to at least 200 GB enables smooth system configuration
      Table. AI&MLOps Platform Service Information Input Items
      Reference
      • A Windows OS node pool can be created only when an additional storage (CIFS) volume is in use in the cluster.
      • Volume encryption for node pool Block Storage can only be set at initial creation.
        • Enabling encryption may cause performance degradation in some features.
      • Only when you have selected the node pool auto‑scaling or shrinking feature can you input node count, minimum node count, maximum node count.

    • In the Additional Information Input area, enter or select the required information.

      Category
      Required
      Detailed description
      tagSelectionSelect tags to add to the AI&MLOps Platform
      • Click ‘Add Tag’ to create a new tag or add an existing tag
      • You can register up to 50 tags
      • The newly added tags will be applied after the service creation is completed
      Table. AI&MLOps Platform Service Information Input Items

Cluster specifications

To use the AI&MLOps Platform, you need a Kubernetes Engine to install the AI&MLOps Platform. You can select an existing Kubernetes Engine, or you can create a Kubernetes Engine together when creating the AI&MLOps Platform.

The specifications of the Kubernetes cluster required for installation are as follows.

  • Node pool resource size (composed of 2 or more nodes)

    • AI&MLOps Platform: vCPU 32, Memory 128G or more
    • Kubeflow Mini: vCPU 24, Memory 96G or more
  • Kubernetes version

    • AI&MLOps Platform v1.9.1 (k8s v1.30)
    • Kubeflow Mini v1.9.1 (k8s v1.30)
information
Only one AI&MLOps Platform can be installed per Kubernetes cluster, and a cluster that is being used for other purposes cannot have the AI&MLOps Platform installed.

2.2.2 - Kubeflow Usage Guide

Below, we guide you on how to use Kubeflow after creating it.

Add Kubeflow User

Below is a guide on how to use Kubeflow after it has been created.

Kubeflow only creates the account of the single Admin User entered on the initial installation screen.

When using the Kubeflow Dashboard, to add users other than the initial user, you must modify the settings of Dex (the authentication integration component of Kubeflow).

  • Dex is deployed in the auth namespace, and its configuration is stored in a configmap named dex.
Reference
Kubeflow separates namespaces for each user.

The following is an example of Dex configuration.

Color mode
apiVersion: v1
kind: ConfigMap
metadata:
  name: dex
  namespace: auth
data:
  config.yaml: |
    issuer: http://dex.auth.svc.cluster.local:5556/dex
    storage:
      type: kubernetes
      config:
        inCluster: true
    web:
      http: 0.0.0.0:5556
    logger:
      level: "debug"
      format: text
    oauth2:
      skipApprovalScreen: true
    enablePasswordDB: true
    staticPasswords:
    - email: admin@kubeflow.org
      hash: $2y$10$Yb9WVbn8pzVSM6fBgKdFae1Bh6Z.XTihi7bNu3sB6/h5bt1JuUOgq
      username: admin
      userID: 9cb67307-fd6d-4441-9b59-52acd78f4c9e
    staticClients:
    - id: kubeflow-oidc-authservice
      redirectURIs: ["/login/oidc"]
      name: 'Dex Login Application'
      secret: pUBnBOY80SnXgjibTYM9ZWNzY2xreNGQok    
apiVersion: v1
kind: ConfigMap
metadata:
  name: dex
  namespace: auth
data:
  config.yaml: |
    issuer: http://dex.auth.svc.cluster.local:5556/dex
    storage:
      type: kubernetes
      config:
        inCluster: true
    web:
      http: 0.0.0.0:5556
    logger:
      level: "debug"
      format: text
    oauth2:
      skipApprovalScreen: true
    enablePasswordDB: true
    staticPasswords:
    - email: admin@kubeflow.org
      hash: $2y$10$Yb9WVbn8pzVSM6fBgKdFae1Bh6Z.XTihi7bNu3sB6/h5bt1JuUOgq
      username: admin
      userID: 9cb67307-fd6d-4441-9b59-52acd78f4c9e
    staticClients:
    - id: kubeflow-oidc-authservice
      redirectURIs: ["/login/oidc"]
      name: 'Dex Login Application'
      secret: pUBnBOY80SnXgjibTYM9ZWNzY2xreNGQok    
Code block. Dex environment configuration example

When the enablePasswordDB value in the configuration is true, Dex stores the list of users defined in staticPasswords from the configmap into its internal storage when the service starts. Therefore, by adding new user entries composed of email, hash, username, and userID to staticPasswords, you can freely add users beyond the initial ones and use the Kubeflow service.

The attribute values for adding a user can be defined as follows.

parameterExplanation
emailA value in a standard E‑mail format
hashBcrypt algorithm encrypted user password value, and you can directly input the hash value generated by the Bcrypt algorithm
usernameUser name
  • follows the Kubernetes namespace naming conventions
  • 63-character limit, lowercase letters, numbers, and - only these characters are allowed
userIDA uniquely identifiable ID value
  • The initial user’s userID is generated using the uuidgen command
Table. Attribute values for adding a user

From a node where you can use kubectl, use the following command to enter the edit screen of dex configmap.

Color mode
kubectl edit configmap dex -n auth
kubectl edit configmap dex -n auth
Code block. kubectl - modify dex configmap
Color mode
staticPasswords:
    - email: admin@kubeflow.org
      hash: $2y$10$Yb9WVbn8pzVSM6fBgKdFae1Bh6Z.XTihi7bNu3sB6/h5bt1JuUOgq
      username: admin
      userID: 9cb67307-fd6d-4441-9b59-52acd78f4c9e
    - email: sds@samsung.com
      hash: $2y$12$0g5.y86jnrt0v6In5NRCZ.YVuvrAUQ6j/RJYO3rV.kNulaDALOKfq
      username: sds
      userID: 8961d517-3498-4148-90c9-7e442ee91154
staticPasswords:
    - email: admin@kubeflow.org
      hash: $2y$10$Yb9WVbn8pzVSM6fBgKdFae1Bh6Z.XTihi7bNu3sB6/h5bt1JuUOgq
      username: admin
      userID: 9cb67307-fd6d-4441-9b59-52acd78f4c9e
    - email: sds@samsung.com
      hash: $2y$12$0g5.y86jnrt0v6In5NRCZ.YVuvrAUQ6j/RJYO3rV.kNulaDALOKfq
      username: sds
      userID: 8961d517-3498-4148-90c9-7e442ee91154
Code block. Modify dex configmap

Since the staticPasswords value in the configmap is applied when the Dex service starts, restart the Dex service using the following command.

Color mode
kubectl rollout restart deployment dex -n auth
kubectl rollout restart deployment dex -n auth
Code block. kubectl - dex restart

Attempt to log in using new user information.

Figure 1
New user information login

Verify that after successful login, it transitions to the screen for creating a new Namespace(profile).

Figure 2
Create Namespace Name

The above content was written with reference to the official Kubeflow site. For more details, see Kubeflow Profiles.

How to use Custom Image in Kubeflow Jupyter Notebook

To use a custom image in the Kubeflow Notebook Controller that manages the Notebook life cycle of Kubeflow, you must meet several requirements.

Kubeflow assumes that Jupyter will start automatically when a Notebook image is run. Therefore, you need to set the default command to start Jupyter in the container image.

The following is an example of what should be included in a Dockerfile.

Color mode
ENV NB_PREFIX

CMD ["sh","-c", "jupyter notebook --notebook-dir=/home/${NB_USER} --ip=0.0.0.0 --no-browser --allow-root --port=8888 --NotebookApp.token='' --NotebookApp.password='' --NotebookApp.allow_origin='*' --NotebookApp.base_url=${NB_PREFIX}"]
ENV NB_PREFIX

CMD ["sh","-c", "jupyter notebook --notebook-dir=/home/${NB_USER} --ip=0.0.0.0 --no-browser --allow-root --port=8888 --NotebookApp.token='' --NotebookApp.password='' --NotebookApp.allow_origin='*' --NotebookApp.base_url=${NB_PREFIX}"]
Code block. Dockerfile example

The above items are explained as follows.

parameterExplanation
--notebook-dir=/home/jovyanSet working directory
  • /home/jovyan directory is mounted to a Kubernetes persistent volume (PV)
--ip=0.0.0.0Allow Jupyter Notebook to accept connections from any IP
--allow-rootAllow the user to run Jupyter Notebook as root
--port=8888Port configuration
--NotebookApp.token=’’ –NotebookApp.password=’’Disable Jupyter authentication
  • Since Kubeflow relies on Istio for authentication, the authentication feature provided by Jupyter is disabled
  • With this configuration, you can access the Jupyter Notebook Server without a password
--NotebookApp.allow_origin=’*’Allow origin
--NotebookApp.base_url=NB_PREFIXBase URL setting
Table. Settings to include in Dockerfile

You can create a Custom Image by referring to the Dockerfile that builds the tesorflow notebook image.

Reference
Custom Image must be stored in a public registry such as Docker Hub or a private registry, and be push/pullable from Kubeflow.
  1. On the Notebook Servers page, click the +NEW SERVER button.

    Figure 3

  2. If you have created a Custom Image, check Custom Image on the Kubeflow Notebook Server screen and enter the Custom Image address to create a new Notebook Server.

    Figure 4

Information

The above content was written with reference to the Kubeflow official site.

2.3 - API Reference

API Reference

2.4 - CLI Reference

CLI Reference

2.5 - Release Note

AI&MLOps Platform

2025.07.01
FEATURE AI&MLOps Platform open-source version upgrade
  • The AI&MLOps Platform open-source version has been upgraded.
    • Kubeflow 1.9
2025.02.27
NEW AI&MLOps Platform service official version release
  • The AI&MLOps Platform service, which automates repetitive tasks across the entire pipeline of machine learning model development, training, and deployment, has been launched.
  • We provide a machine learning platform service based on Kubernetes.