The page has been translated by Gen AI.

Overview

Service Overview

AI&MLOps Platform is a machine learning platform that automates the repetitive tasks of the entire pipeline of machine learning model development, learning, and deployment process. Through the AI&MLOps Platform service, integrated management of training data, models, and operational data is possible based on a Kubernetes-based AI/MLOps environment.

AI&MLOps Platform is an open-source product that provides Kubeflow.Mini service, which can utilize the development, learning, tuning, and deployment functions of machine learning models, and Enterprise service that adds Add-on functions such as distributed learning Job execution and monitoring.

Reference

AI&MLOps Platform related sites, please refer to Kubeflow.

Features

Cloud Native MLOps Environment: AI&MLOps Platform provides a machine learning model development environment optimized for the cloud, and it is convenient to link with various open sources based on Kubernetes.
Machine Learning Development and Operational Convenience: Provides a standardized environment that supports various machine learning frameworks such as TensorFlow, PyTorch, scikit-learn, Keras, etc. It automates the entire pipeline of machine learning model development, training, and deployment, making it easy to configure and create models, and reusable.
GPU Collaboration Enhancement: With Bare Metal Server-based Multi Node GPU and GPUDirect RDMA (Remote Direct Memory Access), the job speed of LLM (Large Language Model) and NLP (Natural Language Processing) can be dramatically improved.

Service Composition Diagram

Fig. AI&MLOps Platform Configuration Diagram

Provided Features

The AI&MLOps Platform provides the following functions.

ML Model Development Environment and Features
Notebook provision: Creates a Jupyter Notebook and VS Code with ML Framework (Tensorflow, Pytorch, etc.).
TensorBoard: TensorBoard(*ML model training process visualization/analysis tool) server is created and managed.
Volumes: When developing an ML model, use a volume to store datasets and models, and connect a volume when creating a Jupyter Notebook.
Distributed Training Job for ML Model Execution/Management
Supports distributed learning Job execution and monitoring, inference service management and analysis. (Add-on)
It provides various functions for managing Job Queue and configuring MLOps environment, etc. (Add-on)
Job Scheduler(FIFO, Bin-packing, Gang based), GPU Fraction, GPU resource monitoring, etc. provide efficient GPU resource utilization features. (Add-on) BM-based Multi Node GPU and GPU Direct RDMA (Remote Direct Memory Access) significantly improved the job speed of LLM (Large Language Model) and NLP (Natural Language Processing) (Add-on)
ML Model Experiment Management and Pipeline
ML pipeline experiment management is provided through Experiments(KFP). It supports pipeline automation configuration function to execute ML tasks in a step-by-step manner.

Component

Operating System Version

The operating systems supported by the AI&MLOps Platform are as follows.

Operating System(OS)	Version
RHEL	RHEL 8.3
Ubuntu	Ubuntu 18.04, Ubuntu 20.04, Ubuntu 22.04

Table. Supported operating system versions

Regional Provision Status

The AI&MLOps Platform can be provided in the following environments.

Region	Availability
Western Korea(kr-west1)	Provided
Korean East(kr-east1)	Provided
South Korea 1(kr-south1)	Not provided
South Korea southern region 2(kr-south2)	Not provided
South Korea, Busan(kr-south3)	Not provided

Table. AI&MLOps Platform Regional Provision Status

Preceding service

This is a list of services that must be pre-configured before creating this service. Please refer to the guide provided for each service and prepare in advance for more details.

Service Category	Service	Detailed Description
Container	Kubernetes Engine	Kubernetes container orchestration service

Table. AI&MLOps Platform Preceding Service

Release Note

How-to guides