ML code delivery to production in a few days: how we saved a lot of our company money with a couple of scripts by creating easy-to-use ML automation with Kubernetes, Airflow and Jenkins.

3 min readSep 7, 2020

Overview

In our company in the production we have a hundreds of scheduling machine learning (ML) models and calculation scripts on a constant basis. And our Data Engineering division had taken a task to make some CD/CD component that was most suitable for our goals:

Wide amount of usable libraries from project to project
Different executing platforms
Flexible scheduling
Decreasing of Time To Market (TTM)
Scalability and fault tolerance

For achieving our goals we selected Kubernetes as best Docker orchestrating system, Airflow as one of the fast upcoming schedulers and Jenkins as one of the most popular continuous integration continuous delivery (CI/CD) systems. For some optimization tasks we also using private Docker registry by your choose (best ones I think are Nexus or JFrog). We passed two version of our ML automation systems. First one was about containerizing all applications and keep images for each version of each application. There was some disadvantages:

Memory. Each container takes about 2 GB disk space in average.
Time to market. We need additional extra time to each test run to rebuild container while developing.
Debug speed

We stopped at the second one that will be described bellow.

Concepts

DAG generation, runtime Pods generation and code execution directly from Git.

Advantage: solving disadvantages from previous approach.

There are two parts of system working:

DAG generation
DAG execution

Both of this parts use Jinja templates for DAG and for Kubernetes pod structures.

DAG generation

Generation of DAG executes after each commit in Git system with webhook. Jenkins gets webhook and runs DAG generation pipeline. This pipeline uses free utility

https://github.com/SingularBunny/render-jinja-with-yaml

to render our Jinja template and YAML config file.

YAML Configuration files

This files are placed in the project root directory in Git and describe CI/CD system how DAG should be generated and code should be handled. We developed two version on configuration files: single-step and multi-step.

Single-step configuration

Multi-step configuration

Steps could be from different repositories. Each step execution is independent Kubernetes pod. Steps could be external DAGs and converted to Airflow sensors. In each step you could define:

base image from public o private repository
command to execute
git repository, branch, revision with code to execute
kubernetes secrets, config maps, user, group, git sync container user, fsgroup, capabilities
airflow step retries, retry delay

j2 DAG template

Jinja DAG template is used in dag .py generation. For some extra features I recommend to use dynamic DAG generation approach and read article Creating a dynamic DAG using Apache Airflow by Antony Henao

Most parts of the template are common for many DAGs and looks like ordinary Python script:

Default arguments are taken from YAML configuration:

Jinja template for pod generation usually goes from another template file:

Jinja macros that generates operators:

Jinja macros that generates sensors on external dags:

All steps generation using Jinja iteration loops and macroses:

Resulting Python script from a rendering of this template you could see in Appendix 1

DAG execution

In that stage DAG is executed by Airflow. Each operator generates its own pod with Airfow templating with Jinja.

j2 pod generation command template

It is common template for all steps. And each step renders its own command. Template looks like typical kubectl run command with --overrides flag.

It uses public Kunernetes Git-Sync container to pull code from repository and container from config to execute. You could look for more advanced command in Python script from Appendix 1 and its rendering result in Appendix 2

Appendix 1. Generated DAG example

Appendix 2. Rendered command example