Apache Spark Integration Testing

Eugene Lopatkin
3 min readSep 27, 2020

--

Testing with HDFS

Best way ever I think is to start testing of your Spark application with preparing of integration tests. There are a lot of internal and external tools for building this kind of tests. I prefer to use internal testing tools of used frameworks in highest priority with relation to external testing tools. Hadoop developers prepared a wide amount of mini stuff that allow you to start your own cluster with your tests and check integration aspects:

Most often case in context of Apache Spark Integration testing is to use real HDFS instead of standard file system.

MiniDFSCluster

This class creates a single-process DFS cluster for junit testing. The data directories for non-simulated DFS are under the testing directory. For simulated data nodes, no underlying FS storage is used.

Dependencies:

Example:

Testing with Kafka

Spark repository contains some utils to test Kafka interaction. It allow you to use real Kafka in your integration tests.

Dependencies:

KafkaTestUtils

But there is one thing. This class is private. Lets unprivate this class:

And we can use this utils trough our class PublicKafkaTestUtils.

Example:

Testing with Hive

One of the most important features of Spark SQL is support reading and writing data stored in Apache Hive. This support is native and usually developers doesn’t use separate Hive service to test possible integration problems. One of the most frequent troubles of Spark SQL’s is interaction with different versions of Hive metastores, which enables Spark SQL to access metadata of Hive tables. Since version 1.4.0 Spark has a property spark.sql.hive.metastore.version. According to latest official documentation:

Available options are `0.12.0` through `2.3.7` and `3.0.0` through `3.1.2`.

This imposes certain limitations with using some Spark with some versions of Hive. Most of common decisions is to use HiveWarehouseConnector. But it is the main reason because you should test your applications.

Dependencies:

There is no prebuild packages in Maven repository because Hive developers don’t publish test classified artifacts. You could copy and paste necessary files from Hive repository or try to use Maven Search and use some third party stuff.

MiniHS2

This class allows you to start you own Hive server and use it to check results or compatibility.

Possible Mini Cluster types: MR, TEZ, LLAP, LOCALFS_ONLY.

Possible transport mode: binary, http, all.

Example

Testing with JDBC

Spark SQL also includes a data source that can read data from other databases using JDBC.

And of course Spark can read from any Data Base for which a JDBC driver exists. For reasons of Spark Integration Testing with JDBC I prefer to use Testcontainers (Java) library. There is also Scala and Python implementations of this library exist.

Testcontainers is a Java library that supports JUnit tests, providing lightweight, throwaway instances of common databases, Selenium web browsers, or anything else that can run in a Docker container.

Dependencies:

List of all prebuild containers for Scala you could check here.

Example:

--

--

Eugene Lopatkin

I believe that science makes the World better. Big Data, Machine Learning, Quantum Computing.