Overview
Service Overview
AI&MLOps Platform is a machine learning platform that automates repetitive tasks across the entire pipeline of developing, training, and deploying machine learning models. Through the AI&MLOps Platform service, integrated management of training data, models, and operational data is possible on a Kubernetes-based AI/MLOps environment.
The AI&MLOps Platform provides an Enterprise service that adds add-on features such as distributed training job execution and monitoring to the open-source product Kubeflow.Mini, which enables development, training, tuning, and deployment of machine learning models.
Features
Providing a Cloud Native MLOps Environment: The AI&MLOps Platform provides a cloud‑optimized machine learning model development environment, and its Kubernetes‑based architecture makes integration with various open‑source tools convenient.
Machine Learning Development and Operations Convenience: Provides a standardized environment that supports various machine learning frameworks such as TensorFlow, PyTorch, scikit-learn, Keras, etc. By automating the entire pipeline for developing, training, and deploying machine learning models, it makes model composition and creation easy and promotes reusability.
Enhanced GPU Integration: By leveraging Multi‑Node GPU on a Bare Metal Server and GPUDirect RDMA (Remote Direct Memory Access), the job speed of LLM (Large Language Model) and natural language processing (NLP) can be dramatically improved.
Service Diagram
Provided features
The AI&MLOps Platform provides the following features.
ML Model Development Environment and Features
- Notebook Provision: Creates Jupyter Notebooks and VS Code that include ML frameworks such as Tensorflow, Pytorch, etc.
- TensorBoard: TensorBoard(ML model training process visualization/analysis tool) creates and manages the server.
- Volumes: When developing ML models, store datasets and models, and connect a Volume when creating a Jupyter Notebook.
ML model distributed training Job execution/management
- Supports execution and monitoring of distributed training jobs, as well as management and analysis of inference services. (Add-on)
- Provides various features for configuring MLOps environments, such as Job Queue management. (Add-on)
- Provides efficient GPU resource utilization features such as Job Scheduler (FIFO, Bin-packing, Gang-based), GPU Fraction, and GPU resource monitoring, etc. (Add-on)
- We dramatically improved the job speed of LLM (Large Language Model) and natural language processing (NLP) by using BM-based Multi-Node GPU and GPU Direct RDMA (Remote Direct Memory Access). (Add-on)
ML Model Experiment Management and Pipeline
- Provides Experiments (KFP) for managing ML pipeline experiments.
- Supports pipeline automation features for configuring and executing ML tasks in stages.
Component
Operating System version
The operating systems supported by the AI&MLOps Platform are as follows.
| Operating System (OS) | Version |
|---|---|
| RHEL | RHEL 8.3 |
| Ubuntu | Ubuntu 18.04, Ubuntu 20.04, Ubuntu 22.04 |
Provision status by region
The AI&MLOps Platform is available in the environments below.
| region | Provision status |
|---|---|
| Korea West (kr-west1) | Provide |
| Korea East (kr-east1) | Provide |
| South Korea South 1 (kr-south1) | Not provided |
| South Korea South 2 (kr-south2) | Not provided |
| South Korea South 3(kr-south3) | Not provided |
Prior Service
This is a list of services that must be pre-configured before creating the service. For details, refer to the guide provided for each service and prepare in advance.
| Service Category | service | Detailed description |
|---|---|---|
| Container | Kubernetes Engine | Kubernetes container orchestration service |
