Apache Spark Integration Testing

Testing with HDFS

Best way ever I think is to start testing of your Spark application with preparing of integration tests. There are a lot of internal and external tools for building this kind of tests. I prefer to use internal testing tools of used frameworks in highest priority with relation to external testing tools. Hadoop developers prepared a wide amount of mini stuff that allow you to start your own cluster with your tests and check integration aspects:

Most often case in context of Apache Spark Integration testing is to use real HDFS instead of standard file system.

This class creates a single-process DFS cluster for junit testing. The data directories for non-simulated DFS are under the testing directory. For simulated data nodes, no underlying FS storage is used.

Dependencies:

Example:

Testing with Kafka

Spark repository contains some utils to test Kafka interaction. It allow you to use real Kafka in your integration tests.

Dependencies:

KafkaTestUtils

But there is one thing. This class is private. Lets unprivate this class:

And we can use this utils trough our class PublicKafkaTestUtils.

Example:

Testing with Hive

One of the most important features of Spark SQL is support reading and writing data stored in Apache Hive. This support is native and usually developers doesn’t use separate Hive service to test possible integration problems. One of the most frequent troubles of Spark SQL’s is interaction with different versions of Hive metastores, which enables Spark SQL to access metadata of Hive tables. Since version 1.4.0 Spark has a property spark.sql.hive.metastore.version. According to latest official documentation:

Available options are `0.12.0` through `2.3.7` and `3.0.0` through `3.1.2`.

This imposes certain limitations with using some Spark with some versions of Hive. Most of common decisions is to use HiveWarehouseConnector. But it is the main reason because you should test your applications.

Dependencies:

There is no prebuild packages in Maven repository because Hive developers don’t publish test classified artifacts. You could copy and paste necessary files from Hive repository or try to use Maven Search and use some third party stuff.

This class allows you to start you own Hive server and use it to check results or compatibility.

Possible Mini Cluster types: MR, TEZ, LLAP, LOCALFS_ONLY.

Possible transport mode: binary, http, all.

Example

Testing with JDBC

Spark SQL also includes a data source that can read data from other databases using JDBC.

And of course Spark can read from any Data Base for which a JDBC driver exists. For reasons of Spark Integration Testing with JDBC I prefer to use Testcontainers (Java) library. There is also Scala and Python implementations of this library exist.

Testcontainers is a Java library that supports JUnit tests, providing lightweight, throwaway instances of common databases, Selenium web browsers, or anything else that can run in a Docker container.

Dependencies:

List of all prebuild containers for Scala you could check here.

Example:

--

--

--

I believe that science makes the World better. Big Data, Machine Learning, Quantum Computing.

Love podcasts or audiobooks? Learn on the go with our new app.

ttinolja Co.,

How Does the Raft Consensus-Based Replication Protocol Work in YugaByte DB?

Q & A: Why You Should Consider Computer Science (Even If You Haven’t Coded Before!)

Latest Interview Questions on Twitter Bootstrap 2018

Testing your GraphQL APIs in a Spring-boot app

Pseudocode and You — Writing in Plain Speech

Decentralized Business Solutions & Digital Wallets

Optimizing CI/CD process of the PHP application

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Eugene Lopatkin

Eugene Lopatkin

I believe that science makes the World better. Big Data, Machine Learning, Quantum Computing.

More from Medium

Kafka Connect

Spring Data Processing Pipeline: Getting Started with YugabyteDB CDC

Kafka producers :

How I sketched an Event Data Hub

Traffic hub