Apache Spark Unit Testing Part 2 — Spark SQL

Eugene Lopatkin
6 min readNov 20, 2019

--

Second part of article series about how to use Spark repository classes for Unit Testing. This series pretend to fill the gap between code and documentation inside Spark Unit Testing domain. Spark has a huge Framework that allow to developers to test their code in any various cases.
Spark SQL package has four sub projects each of which has its own test classes:

In context of testing own Spark jobs we will just discuss only three of them (core, catalyst, hive).

Dependencies

1.1 Spark SQL Execution Unit Testing

SharedSparkSession

Suites extending SharedSparkSession are sharing resources (eg. SparkSession) in their tests. That trait initializes the spark session in its beforeAll() implementation before the automatic thread snapshot is performed, so the audit code could fail to report threads leaked by that shared session. The behavior is overridden here to take the snapshot before the spark session is initialized. Extends SQLTestUtils and SharedSparkSessionBase.

SharedSparkSessionBase

Helper trait for SQL test suites where all tests share a single TestSparkSession.

TestSparkSession

A special SparkSession prepared for testing.

SQLTestUtils

Helper trait that should be extended by all SQL test suites within the Spark code base. This allows subclasses to plugin a custom SQLContext. It comes with test data prepared in advance as well as all implicit conversions used extensively by dataframes. To use implicit methods, import testImplicits._ instead of through the SQLContext. Subclasses should not create SQLContexts in the test suite constructor, which is prone to leaving multiple overlapping SparkContexts in the same JVM. Extends SparkFunSuite, SQLTestUtilsBase and PlanTestBase.

SQLTestUtilsBase

Helper trait that can be extended by all external SQL test suites. This allows subclasses to plugin a custom SQLContext. To use implicit methods, import testImplicits._ instead of through the SQLContext. Subclasses should not create SQLContexts in the test suite constructor, which is prone to leaving multiple overlapping SparkContexts in the same JVM. Extends SQLTestData and PlanTestBase. Contains:

A helper object for importing SQL implicits. Note that the alternative of importing spark.implicits._ is not possible here. This is because we create the SQLContext immediately before the first test is run, but the implicits import is needed in the constructor.

SQLTestData

A collection of sample data used in SQL tests.

QueryTest

Great framework to checking results inside SQL package. Contains big amount of DataFrame and Dataset assertions and checks.

LocalSparkSession

Manages a local spark SparkSession variable, correctly stopping it after each test.

Example:

Metrics Testing

The base for statistics test cases that we want to include in both the hive module (for verifying behavior when using the Hive external catalog) as well as in the sql/core module.

ColumnarTestUtils

Object with useful methods for columnar based cases.

Example:

File Based Tests

FileBasedDataSourceTest

A helper trait that provides convenient facilities for file-based data source testing. Specifically, it is used for Parquet and Orc testing. It can be used to write tests that are shared between Parquet and Orc.

Orc Testing

OrcTest

Uses for testing with data in Orc file format.

Parquet Testing

ParquetTest

A helper trait that provides convenient facilities for Parquet testing.
NOTE: Considering classes Tuple1Tuple22 all extend Product, it would be more convenient to use tuples rather than special case classes when writing test cases/suites. Especially, Tuple1.apply can be used to easily wrap a single type/value.

ParquetCompatibilityTest

Helper class for testing Parquet compatibility.

DataSourceTest

Could check how Spark executes SQL queries.

SparkPlanTest

Base class for writing tests for individual physical operators. For an example of how this class’s test helper methods can be used, see SortSuite. Extends SparkFunSuite. Companion object contains helper methods for writing tests of individual physical operators.

BenchmarkQueryTest

Checks if generated queue has appropriated size either JIT optimization might not work.

IntegratedUDFTestUtils

This object targets to integrate various UDF test cases so that Scalar UDF, Python UDF and Scalar Pandas UDFs can be tested in SBT & Maven tests. The available UDFs are special. It defines an UDF wrapped by cast. So, the input column is casted into string, UDF returns strings as are, and then output column is casted back to the input column. In this way, UDF is virtually no-op. Note that, due to this implementation limitation, complex types such as map, array and struct types do not work with this UDFs because they cannot be same after the cast roundtrip. To register Scala UDF in SQL:

To register Python UDF in SQL:

To register Scalar Pandas UDF in SQL:

To use it in Scala API and SQL:

Streaming DataFrames and streaming Datasets Testing

StreamTest

A framework for implementing tests for streaming queries and sources. A test consists of a set of steps (expressed as a StreamAction) that are executed in order, blocking as necessary to let the stream catch up. For example, the following adds some data to a stream, blocking until it can verify that the correct values are eventually produced.

Example:

Note that while we do sleep to allow the other thread to progress without spinning, StreamAction checks should not depend on the amount of time spent sleeping. Instead they should check the actual progress of the stream before verifying the required test condition. Currently it is assumed that all streaming queries will eventually complete in 10 seconds to avoid hanging forever in the case of failures. However, individual suites can change this by overriding streamingTimeout. Extends QueryTest with SharedSparkSession.

StateStoreMetricsTest

Extends StreamTest. In addition stores states.

StreamManualClock

Used for streaming tests that allows checking whether the stream is waiting on the clock at expected times.

1.2 Catalyst Unit Testing

PlanTest

Extends SparkFunSuite and PlanTestBase.
There is no other code, just mixin two traits. When you don’t need to use SparkFunSuite just PlanTestBase could be used.

PlanTestBase

Base class for plan tests. Extends PredicateHelper with SQLHelper.

Example:

Extension of with some useful methods. Also it creates two plan analyzers. Case sensitive and Insensitive

A few helper functions for expression evaluation testing. Mixin this trait to use them. Used basically for Catalyst development.

1.3 Spark Hive Unit Testing

TestHiveSingleton

Base class for Spark Hive unit tests. Extends SparkFunSuite.

TestHiveContext

A locally running test instance of Spark’s Hive execution engine. Data from testTables will be automatically loaded whenever a query is run over those tables. Calling reset will delete all tables and other state in the database, leaving the database in a “clean” state. TestHive is singleton object version of this class because instantiating multiple copies of the hive metastore seems to lead to weird non-deterministic failures. Therefore, the execution of test cases that rely on TestHive must be serialized. Extends SQLContext.

HiveClientBuilder

Builder that makes HiveClient for test purposes.

Example:

HiveComparisonTest

Allows the creations of tests that execute the same query against both hive and catalyst, comparing the results. The “golden” results from Hive are cached in and retrieved both from the classpath and answerCache to speed up testing. See the documentation of public vals in this class for information on how test execution can be configured using system properties. Extends SparkFunSuite.

HiveQueryFileTest

A framework for running the query tests that are listed as a set of text files. Test Suites that derive from this class must provide a map of testCaseName to testCaseFiles that should be included. Additionally, there is support for whitelisting and blacklisting tests as development progresses. Extends HiveComparisonTest.

--

--

Eugene Lopatkin

I believe that science makes the World better. Big Data, Machine Learning, Quantum Computing.