The page has been translated by Gen AI.

Overview

Service Overview

AI&MLOps Platform is a machine learning platform that automates repetitive tasks across the entire pipeline of developing, training, and deploying machine learning models. Through the AI&MLOps Platform service, integrated management of training data, models, and operational data is possible on a Kubernetes-based AI/MLOps environment.

The AI&MLOps Platform provides an Enterprise service that adds add-on features such as distributed training job execution and monitoring to the open-source product Kubeflow.Mini, which enables development, training, tuning, and deployment of machine learning models.

Reference
For AI&MLOps Platform related sites, refer to Kubeflow.

Features

  • Providing a Cloud Native MLOps Environment: The AI&MLOps Platform provides a cloud‑optimized machine learning model development environment, and its Kubernetes‑based architecture makes integration with various open‑source tools convenient.

  • Machine Learning Development and Operations Convenience: Provides a standardized environment that supports various machine learning frameworks such as TensorFlow, PyTorch, scikit-learn, Keras, etc. By automating the entire pipeline for developing, training, and deploying machine learning models, it makes model composition and creation easy and promotes reusability.

  • Enhanced GPU Integration: By leveraging Multi‑Node GPU on a Bare Metal Server and GPUDirect RDMA (Remote Direct Memory Access), the job speed of LLM (Large Language Model) and natural language processing (NLP) can be dramatically improved.

Service Diagram

Diagram
Figure. AI&MLOps Platform Diagram

Provided features

The AI&MLOps Platform provides the following features.

  • ML Model Development Environment and Features

    • Notebook Provision: Creates Jupyter Notebooks and VS Code that include ML frameworks such as Tensorflow, Pytorch, etc.
    • TensorBoard: TensorBoard(ML model training process visualization/analysis tool) creates and manages the server.
    • Volumes: When developing ML models, store datasets and models, and connect a Volume when creating a Jupyter Notebook.
  • ML model distributed training Job execution/management

    • Supports execution and monitoring of distributed training jobs, as well as management and analysis of inference services. (Add-on)
    • Provides various features for configuring MLOps environments, such as Job Queue management. (Add-on)
    • Provides efficient GPU resource utilization features such as Job Scheduler (FIFO, Bin-packing, Gang-based), GPU Fraction, and GPU resource monitoring, etc. (Add-on)
    • We dramatically improved the job speed of LLM (Large Language Model) and natural language processing (NLP) by using BM-based Multi-Node GPU and GPU Direct RDMA (Remote Direct Memory Access). (Add-on)
  • ML Model Experiment Management and Pipeline

    • Provides Experiments (KFP) for managing ML pipeline experiments.
    • Supports pipeline automation features for configuring and executing ML tasks in stages.

Component

Operating System version

The operating systems supported by the AI&MLOps Platform are as follows.

Operating System (OS)Version
RHELRHEL 8.3
UbuntuUbuntu 18.04, Ubuntu 20.04, Ubuntu 22.04
Table. Supported Operating System Versions

Provision status by region

The AI&MLOps Platform is available in the environments below.

regionProvision status
Korea West (kr-west1)Provide
Korea East (kr-east1)Provide
South Korea South 1 (kr-south1)Not provided
South Korea South 2 (kr-south2)Not provided
South Korea South 3(kr-south3)Not provided
Table. AI&MLOps Platform regional availability status

Prior Service

This is a list of services that must be pre-configured before creating the service. For details, refer to the guide provided for each service and prepare in advance.

Service CategoryserviceDetailed description
ContainerKubernetes EngineKubernetes container orchestration service
Table. AI&MLOps Platform Preliminary Services
Release Note
How-to guides