[paper review] MLOps: Overview, Definition, and Architecture

99_DS

[paper review] MLOps: Overview, Definition, and Architecture

99_shimshim 2025. 5. 25. 22:02

728x90

review date : 2025.05.24

학과 스터디에서 2주차 발표를 맡게 되었다.

준비 기간이 일주일로 짧았지만, 해당 논문을 한번 리뷰 했었고 그 후에 회사에서 구축 프로젝트에도 참여했기 때문에, 배경지식이 생긴 채로 다시 리뷰하면 어떨까 하고 첫 발표 제안을 승낙했다.

Abstract
1. the final goal of ML project: 모델 개발 및 ‘빠른’ 상품화
2. But, 프로세스 자동화와 운영은 대부분 기대 이하로 작동함.
3. 이를 해결하기 위해 MLOps가 제안되었으나, 모호한 term으로 남아있음.
4. 본 연구는 문헌 리뷰, 툴 리뷰, 전문가 인터뷰를 통해 MLOps의 필수 원칙, 구성요소와 역할, 아키텍처, 워크플로우에 대한 통합적인 overview를 제공한다.
Introduction
1. 수많은 ML 프로젝트들이 좋은 모델을 만들기 위한 부분에만 과도하게 집중하고, 프로덕션 환경으로 모델을 가져오는 것과 결과에 상응하는 현실세계의 인프라/자동화 시스템을 제공하는 것에는 소홀이 한다.
  1. 데이터 사이언티스트들은 아직도 많은 ML workflow를 manual하게 관리한다.
  2. 이러면 ML 솔루션 상 운영에 많은 문제가 생김
2. PoC 단계에서 사라지는 많은 프로젝트들을 구제하기 위해 MLOps가 제안되어 ‘제품화될 수 있는’ ML을 설계하고 유지보수하는 것을 돕는다.
3. To Answer the question “What is MLOps?”, we conduct a mixed-method research
  1. MLOps의 중요원칙
  2. MLOps의 기능적 주요 구성요소
  3. MLOps의 성공적인 실행을 위한 직무별 역할
  4. ML 시스템 설계의 일반적인 아키텍처
Foundations of DevOps
1. production-ready SW products를 출시하기 위한 방법론: Waterfall vs Agile
2. DevOps 개념은 2008/2009에 제시되었음.
  1. 개발, 운영 사이의 갭을 없애는 것이 목적
  2. 협업, 공유를 강조
  3. CI/CD 자동화를 촉진 (continuous integration / continuous delivery and deployment)
  4. 부가적으로 continuous testing, QA, monitoring, logging, feedback loops까지 보증
3. tools (6 groups)
  1. collaboration and knowledge sharing
    1. Slack
  2. code management
    1. Github, Gitlab
  3. build process
    1. Maven
  4. continuous integration
    1. Jenkins
  5. deployment automation
    1. Kubernetes, Docker
  6. monitoring and logging
    1. Prometheus
4. DevOps는 ML 자동화 및 운영으로 확장되고 있음.
Methodology
1. Literature Review
  1. MLOps 분야의 초창기에 연구가 진행되었기 때문에 peer-reviewed 27 articles만 추려서 선행연구 대상으로 삼음.
2. Tool Review
  1. Interview Study
    1. 8명의 전문가 인터뷰
Results : Principles / Technical Components / Roles
1. Principles: means a “guide” to how things should be realized (== best practice)
  1. P1: CI/CD automation
    1. DevOps의 아이디어를 실제로 가져온 것으로, build, test, delivery, deploy 수행.
    각 단계마다 성공/실패를 빠르게 피드백해주어 전반적인 생산성 향상
  2. P2: Workflow Orchestration
    1. DAGs(Directed Acyclic Graphs)에 따라 ML workflow pipeline을 연결하는 기능
    2. DAGs ≈ pipeline 설계도
  3. P3: Reproducibility
    1. 이전에 했던 실험 불러와서 다시 같은 결과를 얻는 기능
  4. P4: Versioning
    1. P3을 위한 버전관리 기능
    2. 회계감사/내부통제를 위한 버전관리 기능
  5. P5: Collaboration
    1. data, model, code에 대한 협업이 가능하도록 하는 기능
  6. P6: continuous ML training & evaluation (CT)
    1. new feature data retraining
    2. update frequency(daily/weekly…)가 신중하게 결정되어야 함.
    3. online learning을 통해 재학습 비용을 줄일 수 있음.
      1. 배치학습(오프라인학습): 개발 시 전체 데이터 학습 후 운영 시 미학습
      2. 온라인학습: 스트리밍 데이터를 운영 중에도 계속 미니배치로 학습
  7. P7: ML metadata tracking/logging
    1. trainig job iteration: 학습일시, 학습시간…
    2. model specific metadata: hp, metrics, lineage(data, code)
  8. P8: Continuous monitoring
    1. serving performance 기록
  9. P9: Feedback loops
    1. 운영 상 인사이트를 개발에 반영하기 위한 복수의 피드백루프
    2. 재학습 기준을 잡는 퍼포먼스 피드백 루프
2. Technical components: P1~P9를 시스템 구성요소로 포함해서 기술적으로 구현한 것
  1. C1: CI/CD Component (P1, P6, P9)
    1. build, test, delivery, deploy 관장
    2. 빠른 피드백 루프 제공
    3. ex. Jenkins, Github actions
    4. 자동화된 오류수정, 학습 및 추론 소스코드 등을 shippable한 format(python whl)으로 저장
    5. 유닛/통합 테스트 수행
    6. 모든 과정은 자동화되고 CI/CD 툴이 다이나믹하게 할당하는 리소스를 활용하여 매번 동일한 과정과 결과를 도출한다.
  2. C2: Source Code Repository (P4, P5)
    1. 개발자들이 협업가능하도록 모델 학습/추론/앱 소스코드 commit, merge 기능 제공
    2. Github, Gitlab, Gitea, BitBucket
  3. C3: Workflow Orchestration Component (P2, P3, P6)
    1. DAGs에 따라 task orchestration 진행
      1. 실행순서 정의
      2. 각 스텝에서의 결과물 활용 정의
    2. 패키징된 코드를 데이터 추출, 학습, 추론, 모델 저장 등 일련의 과정 진행
    3. Apache Airflow, Kubeflow Pipelines, Watson Studio Pipelines, Luigi, AWS SageMaker, Azure Pipelines
    4. 이론적으로는 스케줄링까지 가능하기는 하지만, 복잡한 workflow orchestration에 특화된 구성요소
  4. C4: Feature Store System (P3, P4)
    1. offline feature store: 실험 개발을 위한 normal latency의 저장소
    2. online store: 추론 예측을 위한 low latency의 저장소
    3. Google Feast, AWS feature store, Tecton.ai, Hopswork.ai
    4. usecase에 highly dependent: workload의 scalability를 실현하기 위해 온프레미스보다 클라우드 방식 선호.
  5. C5: Model Training Infrastructure (P6)
    1. scalable & distributed infra for CPU, RAM, GPUs
    2. 제어 툴: Kubernetes, Red Hat Openshift
  6. C6: Model Registry (P3, P4)
    1. 모델을 메타데이터와 함께 저장
    2. MLflow, AWS SageMaker Model Registry, Azure ML Model Registry, Neptune.ai
  7. C7: ML Metadata Strores (P4, P7)
    1. P4, P7 기능 그대로 구현
    2. Kubeflow Pipelines, AWS SageMaker Pipelines, Azure ML, IBM Watson Studio
    3. MLflow는 C6&C7 한번에 제공 << 오픈소스이기도 해서 많이 사용!
  8. C8: Model Serving Component (P1)
    1. 서비스 유형에 따라 인프라 등이 달리 설정됨(배치/실시간/서버리스)
    2. Kubernetes + Docker 조합으로 모델을 컨테이너화 >> Flask API 등으로 서비스
    3. Azure ML REST API, AWS SageMaker Endpoints, IBM Watson Studio, Google Vertex AI prediction service
  9. C9: Monitoring Component (P8, P9)
    1. 성능평가지표에 대한 지속적인 모니터링
    2. CI/CD, orchestration, infra 등에 대한 지속적인 모니터링
    3. Prometheus, Grafana, ELK stack, TensorBoard, Kubeflow, MLflow, AWS SageMaker model monitor
3. Roles
  1. R1: Business Stakeholder(PO/PM)
    1. ML을 통해 얻고 싶은 비즈니스 goal 정의(ROI)
    2. communication 관리
  2. R2: Solutions Architect: 아키텍처 설계, 필요 기술 정의
  3. R3: Data Scientist: 비즈니스 문제를 ML 문제로 해석, 모델 학습 및 선택
  4. R4: Data Engineer
    1. 데이터 파이프라인 관리
    2. feature store system에서 데이터를 소화할 수 있도록 관리
  5. R5: Software Engineer: ****ML 문제를 SW 관점에서 재정비
  6. R6: DevOps Engineer
    1. 개발-운영 브릿징
    2. CI/CD, ML workflow orchestration, 모니터링 등 MLOps의 전반 관리
  7. R7: ML Engineer / MLOps Engineer
    1. 여러 role의 skills 결합하여 MLOps 총괄
Architecture And Workflow
Conceptualization
1. 위 벤다이어그램의 교집합(?)이 MLOps paradigm
2. MLOps (Machine Learning Operations) is a paradigm, including aspects like best practices, sets of concepts, as well as a development culture when it comes to the end-to-end conceptualization, implementation, monitoring, deployment, and scalability of machine learning products. Most of all, it is an engineering practice that leverages three contributing disciplines: machine learning, software engineering (especially DevOps), and data engineering. MLOps is aimed at productionizing machine learning systems by bridging the gap between development (Dev) and operations (Ops). Essentially, MLOps aims to facilitate the creation of machine learning products by leveraging these principles: CI/CD automation, workflow orchestration, reproducibility; versioning of data, model, and code; collaboration; continuous ML training and evaluation; ML metadata tracking and logging; continuous monitoring; and feedback loops.
Open Challenges
1. Organizational Challenges
  1. 마인드셋과 문화에 대한 도전과제
  2. cultural shift 필요
    1. model driven ML → product oriented discipline
    2. data-related aspects > ML model building
  3. DS 혼자서 다룰 수 없는 문제
    1. 모델 개발 뿐만 아니라 ML 프로덕트 생산에 대한 기능적 교육과정으로 확대 필요
    2. 다학제적 연구 팀 필요
2. ML System Challenges
  1. 적정 인프라 리소스를 예측하기 어려움 → 클라우드 컴퓨팅 확장 중
3. Operational Challenges
  1. SW/HW의 다양한 환경으로 수동 운영 어려움 → robust한 자동화 기술 필요
  2. 지속적인 새 데이터 유입 → 자동 재학습 필요
  3. 복잡한 시스템이기 때문에 문제 발생 시 기저 문제 특정이 어려움 (여러 영역이 결합된 문제인 경우 많음)
Conclusion
1. 연구 배경
  1. ML모델이 쏟아져 나오고 있으나, PoC 단계에서 멈추는 경우가 다반사.
  2. 학술 분야에서는 모델 빌딩과 벤치마킹에 과도하게 초점: real world 시나리오에 관심 부족
  3. 데이터 사이언티스트들은 여전히 수동으로 모델 관리
2. 연구 결과
  1. Principles, Components, Roles, Architecture 네 가지 영역으로 MLOps의 holistic definition 도출
  2. both 연구자들과 실무자들을 위한 common understanding 제공

728x90

저작자표시 비영리 변경금지 (새창열림)