Apache Spark Unit Testing Part 3 — Streaming

Third part of article series is about Spark Streaming unit testing. This series pretend to fill the gap between code and documentation inside Spark Unit Testing domain. Spark has a huge Framework that allow to developers to test their code in any various cases.

There is one package spark streaming. Tests are placed here.

Dependencies

Streaming Unit Testing

TestSuiteBase

This is the base trait for Spark Streaming test suites. This provides basic functionality to run user-defined set of input on user-defined stream operations, and verify the output. Extends SparkFunSuite and Logging.

Additional service classes:

This is a input stream just for the test suites. This is equivalent to a checkpointable, replayable, reliable message queue like Kafka. It requires a sequence as input, and returns the ith element at the ith batch under manual clock.

This is a output stream just for the testsuites. All the output is collected into a ConcurrentLinkedQueue. This queue is wiped clean on being restored from checkpoint. The buffer contains a sequence of RDD’s, each containing a sequence of items.

This is a output stream just for the test suites. All the output is collected into ConcurrentLinkedQueue. This queue is wiped clean on being restored from checkpoint. The queue contains a sequence of RDD’s, each containing a sequence of partitions, each containing a sequence of items.

An object that counts the number of started / completed batches. This is implemented using a StreamingListener. Constructing a new instance automatically registers a StreamingListener on the given StreamingContext.

LocalStreamingContext

Manages a local ssc StreamingContext variable, correctly stopping it after each test. Note that it also stops active SparkContext if stopSparkContext is set to true (default). In most cases you may want to leave it, to isolate environment for SparkContext in each test.

I believe that science makes the World better. Big Data, Machine Learning, Quantum Computing.