Streaming ML CI/CD in a few days: how we improved our easy-to-use code delivery system for Spark Streaming applications and ML

Overview

A significant trend in the development of IT business in now days is readiness to work with hot data, the lifetime of which from the moment of its appearance can be less than a second.

Let’s say you come to a store and take out a loan to buy a phone. You want to get a loan on favorable terms. And the bank wants to give a loan to a verified client. The time window in which you need credit money is relatively short. An example from Telecom domain. You have run out of money, and at this time your wife is going to deliver a baby. You need to call an ambulance. A trust payment from an operator would be very helpful.

Cost of data for business in such situations is extremely important. This state of affairs in the market has led to the emergence of kappa architecture and streaming data handling, which should guarantee minimal latency.

Recently I wrote about our code delivery system based on Kubernetes, Airflow and Jenkins.

Among the advantages of our system:

  1. Extremely low development costs
  2. Solving all required code delivery tasks
  3. Single point of product management
  4. Fault tolerance and monitoring point

For streaming support we want to keep the same way of automation, the same point of control as batch jobs and minimize an amount of using resources.

Concepts

We decided that automation of streaming applications should not be different from automation of batch applications. In addition, we want to build Scala application “on the fly” from repository.

The fact is that Spark Streaming applications block the console after executing the spark-submit. The option

spark.yarn.submit.waitAppCompletion = false

allows us to achieve a spark-submit behavior identical to the batch job submission. Accordingly, after submitting such an application our Kubernetes Pod, under which submitting is happened, collapses and frees up resources.

Next, we have to decide how to understand that the job is running. It is quite simple to do through YARN CLI. This bash command does it:

while yarn application -list | grep InternationalRoamingApp | grep -e RUNNING -e ACCEPTED; do sleep 10; done && exit 1 > /dev/null 2>&1

If application fails we also should to rerun spark-submit steps to restart application.

DAG generation remastering

Now when we’ve found out how Spark jobs should be handled. We should just add SSH operator generation support for our automation system which will allows to check job state.

New structures of Single-step and multi-step files. Rework allows declaring command for remote execution (in our case it is a Hadoop cluster) and additional containers for Pod that could mix in some additional logic (in our case application building).

Single-step configuration file

Multi-step configuration

Method that allows to restart all steps before failed step:

Conversion for commands for internal usage in Python:

Loops for sidecar containers:

Macros for SSH Operator that executes a remote command:

Resulting Python script from a rendering of this template you could see in Appendix 1.

DAG execution remastering

At the stage of DAG execution, the first step is a spark-submit that runs a pod, which completes an assembly and a submission. After that, the usual SSH operator starts, which controls the launched application. If the application crashes, the SSH operator command crashes, and the flow restarts. We could control number of restarts and restart interval of our streaming application.

Command allows adding sidecar containers. In our case, we are using one to build Scala projects.

Rendering result of this command is in Appendix 2.

ML Model composition in Spark Streaming applications

Spark allows you to use models as:

  1. UDF Function
  2. Transformer or Estimator inside Spark ML pipeline.

Our system allows to split model and data preparation logic, use models as separate microservices and make compositions in many different ways.

Appendix 1. DAG generation example

Appendix 2. Rendered command example

--

--

--

I believe that science makes the World better. Big Data, Machine Learning, Quantum Computing.

Love podcasts or audiobooks? Learn on the go with our new app.

Character Segmentation using Opencv and Python

If you still believe that we should not use XPath tools??

Smart Prototyping with Arduino and Framer X

Communicating with Microservices

Golang: Three common programming Problems

Predicting Titanic Survival Using PySpark in an AWS EMR

IOS 14 Beta | New Features | Bugs | Battery Life

Kubernetes and its use cases

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Eugene Lopatkin

Eugene Lopatkin

I believe that science makes the World better. Big Data, Machine Learning, Quantum Computing.

More from Medium

Spark Starter Guide 4.10: Using Having to Filter on Aggregate Columns

An Optimization Story: Speeding Data Locker Up

Spark Streaming + Flume Integration + Python3

Apache Zeppelin with Spark Example