Go to file
Juliusz Sompolski 737a1bc355 BlockingLineStream (#115)
## What changes are proposed in this pull request?

Investigating OOMs during TPCDS data generation:
it turned out that scala standard library's ProcessBuilder.lineStream would by default create a LinkedBlockingQueue buffer of Integer.MAX_VALUE capacity.
It surfaced after in https://github.com/databricks/tpcds-kit/pull/2 we implemented 10x improvements to dsdgen speed.
Now spark-sql-perf does not keep up with ingesting data from dsdgen, and the buffer will cause OOMs.

Pulled out pieces of ProcessBuilderImpl and ProcessImpl just to create a LinkedBlockingQueue with maxQueueSize=65536 instead.

Also submitted https://github.com/scala/scala/pull/6052

## How was this patch tested?

- ssh on the worker - see that dsdgen is being throttled now, Java memory doesn't explode.
- tested that TPCDS SF100 generated correctly.
2017-08-31 15:16:22 +02:00
bin Update tests to run with Spark 2.2, add NaiveBayes & Bucketizer ML tests (#110) 2017-08-21 15:07:46 -07:00
build Update SBT 2015-08-20 16:46:45 -07:00
dev Add Merge Script 2015-09-09 20:03:52 -07:00
project Include publishing to BinTray in release process 2015-12-23 00:09:35 -08:00
src BlockingLineStream (#115) 2017-08-31 15:16:22 +02:00
.gitignore Fix build and switch to jdk8 2016-05-23 12:54:07 -07:00
.travis.yml Fix build and switch to jdk8 2016-05-23 12:54:07 -07:00
build.sbt Update tests to run with Spark 2.2, add NaiveBayes & Bucketizer ML tests (#110) 2017-08-21 15:07:46 -07:00
LICENSE Initial port. 2015-04-15 20:03:14 -07:00
README.md Update tests to run with Spark 2.2, add NaiveBayes & Bucketizer ML tests (#110) 2017-08-21 15:07:46 -07:00
version.sbt Start the development 0.5.0-SNAPSHOT 2017-08-21 14:21:19 -07:00

Spark SQL Performance Tests

Build Status

This is a performance testing framework for Spark SQL in Apache Spark 2.2+.

Note: This README is still under development. Please also check our source code for more information.

Quick Start

$ bin/run --help

spark-sql-perf 0.2.0
Usage: spark-sql-perf [options]

  -b <value> | --benchmark <value>
        the name of the benchmark to run
  -f <value> | --filter <value>
        a filter on the name of the queries to run
  -i <value> | --iterations <value>
        the number of iterations to run
  --help
        prints this usage text
        
$ bin/run --benchmark DatasetPerformance

MLlib tests

To run MLlib tests, run /bin/run-ml yamlfile, where yamlfile is the path to a YAML configuration file describing tests to run and their parameters.

TPC-DS

How to use it

The rest of document will use TPC-DS benchmark as an example. We will add contents to explain how to use other benchmarks add the support of a new benchmark dataset in future.

Setup a benchmark

Before running any query, a dataset needs to be setup by creating a Benchmark object. Generating the TPCDS data requires dsdgen built and available on the machines. We have a fork of dsdgen that you will need. It can be found here.

// If not done already, you have to set the path for the results
spark.config("spark.sql.perf.results", "/tmp/results")

import com.databricks.spark.sql.perf.tpcds.Tables
// Tables in TPC-DS benchmark used by experiments.
// dsdgenDir is the location of dsdgen tool installed in your machines.
// scaleFactor defines the size of the dataset to generate (in GB)
val tables = new Tables(sqlContext, dsdgenDir, scaleFactor)

// Generate data.
// location is the place there the generated data will be written
// format is a valid spark format like "parquet"
tables.genData(location, format, overwrite, partitionTables, useDoubleForDecimal, clusterByPartitionColumns, filterOutNullPartitionValues)
// Create metastore tables in a specified database for your data.
// Once tables are created, the current database will be switched to the specified database.
tables.createExternalTables(location, format, databaseName, overwrite)
// Or, if you want to create temporary tables
tables.createTemporaryTables(location, format)
// Setup TPC-DS experiment
import com.databricks.spark.sql.perf.tpcds.TPCDS
val tpcds = new TPCDS (sqlContext = sqlContext)

Run benchmarking queries

After setup, users can use runExperiment function to run benchmarking queries and record query execution time. Taking TPC-DS as an example, you can start an experiment by using

val experiment = tpcds.runExperiment(tpcds.interactiveQueries)
experiment.waitForFinish(60*60*10) // optional: wait for results (with timeout)

For every experiment run (i.e. every call of runExperiment), Spark SQL Perf will use the timestamp of the start time to identify this experiment. Performance results will be stored in the sub-dir named by the timestamp in the given spark.sql.perf.results (for example /tmp/results/timestamp=1429213883272). The performance results are stored in the JSON format.

Retrieve results

While the experiment is running you can use experiment.html to list the status. Once the experiment is complete, you can load the results from disk.

// Get all experiments results.
val resultTable = spark.read.json(spark.conf.get("spark.sql.perf.results"))
resultTable.createOrReplaceTempView("sqlPerformance")
sqlContext.table("sqlPerformance")
// Get the result of a particular run by specifying the timestamp of that run.
sqlContext.table("sqlPerformance").filter("timestamp = 1429132621024")
// or
val specificResultTable = spark.read.json(experiment.resultPath)