Apache Spark Unit Testing Part 2 — Spark SQL
Second part of article series about how to use Spark repository classes for Unit Testing. This series pretend to fill the gap between code and documentation inside Spark Unit Testing domain. Spark has a huge Framework that allow to developers to test their code in any various cases.
Spark SQL package has four sub projects each of which has its own test classes:
In context of testing own Spark jobs we will just discuss only three of them (core, catalyst, hive).
Dependencies
1.1 Spark SQL Execution Unit Testing
SharedSparkSession
Suites extending SharedSparkSession
are sharing resources (eg. SparkSession
) in their tests. That trait initializes the spark session in its beforeAll()
implementation before the automatic thread snapshot is performed, so the audit code could fail to report threads leaked by that shared session. The behavior is overridden here to take the snapshot before the spark session is initialized. Extends SQLTestUtils
and SharedSparkSessionBase
.
SharedSparkSessionBase
Helper trait for SQL test suites where all tests share a single TestSparkSession.
TestSparkSession
A special SparkSession
prepared for testing.
SQLTestUtils
Helper trait that should be extended by all SQL test suites within the Spark code base. This allows subclasses to plugin a custom SQLContext
. It comes with test data prepared in advance as well as all implicit conversions used extensively by dataframes. To use implicit methods, import testImplicits._
instead of through the SQLContext
. Subclasses should not create SQLContext
s in the test suite constructor, which is prone to leaving multiple overlapping SparkContext
s in the same JVM. Extends SparkFunSuite
, SQLTestUtilsBase
and PlanTestBase
.
SQLTestUtilsBase
Helper trait that can be extended by all external SQL test suites. This allows subclasses to plugin a custom SQLContext
. To use implicit methods, import testImplicits._
instead of through the SQLContext
. Subclasses should not create SQLContext
s in the test suite constructor, which is prone to leaving multiple overlapping SparkContext
s in the same JVM. Extends SQLTestData
and PlanTestBase
. Contains:
A helper object for importing SQL implicits. Note that the alternative of importing spark.implicits._
is not possible here. This is because we create the SQLContext
immediately before the first test is run, but the implicits import is needed in the constructor.
SQLTestData
A collection of sample data used in SQL tests.
QueryTest
Great framework to checking results inside SQL package. Contains big amount of DataFrame
and Dataset
assertions and checks.
LocalSparkSession
Manages a local spark
SparkSession
variable, correctly stopping it after each test.
Example:
Metrics Testing
The base for statistics test cases that we want to include in both the hive module (for verifying behavior when using the Hive external catalog) as well as in the sql/core module.
ColumnarTestUtils
Object with useful methods for columnar based cases.
Example:
File Based Tests
A helper trait that provides convenient facilities for file-based data source testing. Specifically, it is used for Parquet and Orc testing. It can be used to write tests that are shared between Parquet and Orc.
Orc Testing
Uses for testing with data in Orc file format.
Parquet Testing
A helper trait that provides convenient facilities for Parquet testing.
NOTE: Considering classes Tuple1
… Tuple22
all extend Product
, it would be more convenient to use tuples rather than special case classes when writing test cases/suites. Especially, Tuple1.apply
can be used to easily wrap a single type/value.
Helper class for testing Parquet compatibility.
DataSourceTest
Could check how Spark executes SQL queries.
SparkPlanTest
Base class for writing tests for individual physical operators. For an example of how this class’s test helper methods can be used, see SortSuite
. Extends SparkFunSuite
. Companion object contains helper methods for writing tests of individual physical operators.
BenchmarkQueryTest
Checks if generated queue has appropriated size either JIT optimization might not work.
IntegratedUDFTestUtils
This object targets to integrate various UDF test cases so that Scalar UDF, Python UDF and Scalar Pandas UDFs can be tested in SBT & Maven tests. The available UDFs are special. It defines an UDF wrapped by cast. So, the input column is casted into string, UDF returns strings as are, and then output column is casted back to the input column. In this way, UDF is virtually no-op. Note that, due to this implementation limitation, complex types such as map, array and struct types do not work with this UDFs because they cannot be same after the cast roundtrip. To register Scala UDF in SQL:
To register Python UDF in SQL:
To register Scalar Pandas UDF in SQL:
To use it in Scala API and SQL:
Streaming DataFrames and streaming Datasets Testing
StreamTest
A framework for implementing tests for streaming queries and sources. A test consists of a set of steps (expressed as a StreamAction
) that are executed in order, blocking as necessary to let the stream catch up. For example, the following adds some data to a stream, blocking until it can verify that the correct values are eventually produced.
Example:
Note that while we do sleep to allow the other thread to progress without spinning, StreamAction
checks should not depend on the amount of time spent sleeping. Instead they should check the actual progress of the stream before verifying the required test condition. Currently it is assumed that all streaming queries will eventually complete in 10 seconds to avoid hanging forever in the case of failures. However, individual suites can change this by overriding streamingTimeout
. Extends QueryTest
with SharedSparkSession
.
StateStoreMetricsTest
Extends StreamTest
. In addition stores states.
StreamManualClock
Used for streaming tests that allows checking whether the stream is waiting on the clock at expected times.
1.2 Catalyst Unit Testing
PlanTest
Extends SparkFunSuite
and PlanTestBase
.
There is no other code, just mixin two traits. When you don’t need to use SparkFunSuite
just PlanTestBase
could be used.
PlanTestBase
Base class for plan tests. Extends PredicateHelper
with SQLHelper
.
Example:
Extension of with some useful methods. Also it creates two plan analyzers. Case sensitive and Insensitive
A few helper functions for expression evaluation testing. Mixin this trait to use them. Used basically for Catalyst development.
1.3 Spark Hive Unit Testing
TestHiveSingleton
Base class for Spark Hive unit tests. Extends SparkFunSuite
.
TestHiveContext
A locally running test instance of Spark’s Hive execution engine. Data from testTables
will be automatically loaded whenever a query is run over those tables. Calling reset
will delete all tables and other state in the database, leaving the database in a “clean” state. TestHive
is singleton object version of this class because instantiating multiple copies of the hive metastore seems to lead to weird non-deterministic failures. Therefore, the execution of test cases that rely on TestHive
must be serialized. Extends SQLContext
.
HiveClientBuilder
Builder that makes HiveClient
for test purposes.
Example:
HiveComparisonTest
Allows the creations of tests that execute the same query against both hive and catalyst, comparing the results. The “golden” results from Hive are cached in and retrieved both from the classpath and answerCache
to speed up testing. See the documentation of public val
s in this class for information on how test execution can be configured using system properties. Extends SparkFunSuite
.
HiveQueryFileTest
A framework for running the query tests that are listed as a set of text files. Test Suites that derive from this class must provide a map of testCaseName
to testCaseFiles
that should be included. Additionally, there is support for whitelisting and blacklisting tests as development progresses. Extends HiveComparisonTest
.