The page has been translated by Gen AI.

AI/ML useone MLOps build

Overview

Machine Learning (hereafter ML) is at a stage of expanding beyond the analysis and model development phases into the operational phase. Recently, tasks involving creating ML models and applying them to services have increased, but in practice, more time is spent on data collection/analysis/model tuning than on model development. And in production-level ML systems, functions are needed to automate and manage complex ML workflows throughout the MLOps lifecycle, from model development/training - model tuning - model build/deployment - model management.

Kubeflow is a Kubernetes-based open-source machine learning platform that supports such ML workflows. Google, Cisco, IBM, Red Hat, and others participated, releasing it as an open-source project in 2018, and Version 1.0 was released in March 2020.

Kubeflow provides scalable ML workflow capabilities by appropriately combining various Kubernetes‑based open‑source projects (istio, knative, argo, etc.) across each domain.

In Samsung Cloud Platform, the Kubeflow open source itself is offered as the Kubeflow Mini service, and various Add‑on features are extended to provide the AI&MLOps Platform service.

AI&MLOps Platform is for distributed training job execution and monitoring, inference service management and analysis, job queue management, job scheduler, GPU fraction, You can use various additional features, such as GPU resource monitoring, that are not provided by the open source.

This document explains how to set up and use the AI&MLOps Platform, which supports MLOps, on the Samsung Cloud Platform.

Architecture Diagram

AI&MLOps Platform service installation requires a Kubernetes Cluster. The user creates a Kubernetes Engine service. At this point, the Persistent Volume (PV) of the Kubernetes Cluster is created in the File Storage service.
The user selects a Kubernetes cluster in the VPC to deploy the AI&MLOps Platform. After the AI&MLOps Platform installation is complete, the user can access the AI&MLOps Platform Dashboard URL and use the ML Workflow features provided by the AI&MLOps Platform.
1. Use Object Storage for storing analysis datasets and model files, and it can be integrated from Jupyter Notebook using the S3 SDK.
2. You can set up and use Container Registry, or use an external DockerHub.
3. You can use Cloud Hadoop for data preprocessing and large-scale distributed data processing integration. ※ When using external services such as DockerHub, GitHub, etc. outside the user’s VPC, you need to add rules to the Security Group, Firewall.
If you want to use the AI&MLOps Platform Dashboard by leveraging a domain, create the domain in the DNS service and connect that domain with the Load Balancer and Kubernetes Worker to enable domain‑based access.
To integrate services such as Object Storage, Container Registry, Cloud Hadoop, a separate service creation procedure is required apart from creating the AI&MLOps Platform service, and you can use them in conjunction with the AI&MLOps Platform service by referring to each service’s usage guide.

Use Cases

Building a production process defect detection system based on AI&MLOps Platform

AI&MLOps Platform provides excellent reusability, scalability, and reliability on a Kubernetes-based foundation.

AI&MLOps Platform provides a Jupyter Notebook-based model development environment, It provides hyperparameter tuning and result validation features, enabling performance comparison across multiple models.

It also supports distributed training using the GPU of ML frameworks (Tensorflow, Pytorch, etc.), which can reduce model training time, Provides an Endpoint API that can be called from the application and supports scalable service expansion during model deployment.

When using GPU-based high-performance infrastructure (GPU Direct RDMA), you can improve distributed training performance by an average of 1.5 times.

MLOps-based model development and operational deployment system development

AI&MLOps Platform enables the creation of reusable workflows through pipelines, allowing the configuration of an MLOps environment from model development to deployment, You can automate retraining the model with additional data using the same pipeline.

Prerequisites

To install the AI&MLOps Platform, a Kubernetes Cluster that meets the minimum specifications is required. Also, when creating a Kubernetes cluster, you must select “use” for the Load Balancer option.

Constraints

AI&MLOps Platform Model deployment is only possible with the currently configured Kubernetes cluster. We provide the model development environment and the operational environment for inference services on a single cluster without separating them.

Considerations

You can consider using Object Storage for storing data and models. You can also consider using a Container Registry for storing container images, or consider using an external DockerHub. Additionally, by deploying open-source pipeline tools such as Elyra, you can extend the functionality of the AI&MLOps Platform.

Related service

This is a list of Samsung Cloud Platform services that are related to the features or configurations described in this guide. Refer to it when selecting and designing services.

service group	service	Detailed description
Container	Kubernetes Engine	Kubernetes container orchestration service
Storage	File Storage	Storage that enables multiple client servers to share files over a network connection.
Storage	Object Storage	Object storage that simplifies data storage and retrieval
Networking	Load Balancer	A service that automatically distributes server traffic load.
Networking	DNS	A service for easily configuring and managing domains
Container	Container Registry	A service that easily stores, manages, and shares container images
Cloud Hadoop	Cloud Hadoop	A service that provides Hadoop clusters for easy and fast big data processing/analysis.

Table. List of related services