Go to file

Timothy Hunter 1388722b81 Initial commit for adding MLlib reporting in spark-sql-perf This PR adds basic MLlib infrastructure to run some benchmarks against ML pipelines. There are 2 ways to describe and run ML pipelines: - programatically, in scala (see MLBenchmarks.scala) - using a simple YAML file (see mllib-small.yaml for an example) The YAML approach is preferred because it generates programmatically the cartesian product of all the experiments to run and validates the types of the objects in the yaml file. In both cases, all the ML experiments are standard benchmarks. This PR also moves some code in `Benchmark.scala` : the current code generates path-dependent structural signatures and confuses intellij. It does not include tests, but some small benchmarks can be run locally against a spark 2 installation: ``` $SPARK_HOME/bin/spark-shell --jars $PWD/target/scala-2.10/spark-sql-perf-assembly-0.4.9-SNAPSHOT.jar ``` and then: ```scala com.databricks.spark.sql.perf.mllib.MLLib.run(yamlFile="src/main/scala/configs/mllib-small.yaml") ``` Author: Timothy Hunter <timhunter@databricks.com> Closes #69 from thunterdb/1605-mllib2.		2016-06-22 16:59:49 -07:00
bin	Improvements to running the benchmark	2016-01-24 20:24:54 -08:00
build	Update SBT	2015-08-20 16:46:45 -07:00
dev	Add Merge Script	2015-09-09 20:03:52 -07:00
project	Include publishing to BinTray in release process	2015-12-23 00:09:35 -08:00
src	Initial commit for adding MLlib reporting in spark-sql-perf	2016-06-22 16:59:49 -07:00
.gitignore	Fix build and switch to jdk8	2016-05-23 12:54:07 -07:00
.travis.yml	Fix build and switch to jdk8	2016-05-23 12:54:07 -07:00
build.sbt	Initial commit for adding MLlib reporting in spark-sql-perf	2016-06-22 16:59:49 -07:00
LICENSE	Initial port.	2015-04-15 20:03:14 -07:00
README.md	Main Class for running Benchmarks from the command line	2016-01-19 12:37:51 -08:00
version.sbt	fix checking results and bump to 0.4.9	2016-06-17 12:53:12 -07:00

README.md

Spark SQL Performance Tests

This is a performance testing framework for Spark SQL in Apache Spark 1.6+.

Note: This README is still under development. Please also check our source code for more information.

Quick Start

$ bin/run --help

spark-sql-perf 0.2.0
Usage: spark-sql-perf [options]

  -b <value> | --benchmark <value>
        the name of the benchmark to run
  -f <value> | --filter <value>
        a filter on the name of the queries to run
  -i <value> | --iterations <value>
        the number of iterations to run
  --help
        prints this usage text
        
$ bin/run --benchmark DatasetPerformance

TPC-DS

How to use it

The rest of document will use TPC-DS benchmark as an example. We will add contents to explain how to use other benchmarks add the support of a new benchmark dataset in future.

Setup a benchmark

Before running any query, a dataset needs to be setup by creating a Benchmark object. Generating the TPCDS data requires dsdgen built and available on the machines. We have a fork of dsdgen that you will need. It can be found here.

import com.databricks.spark.sql.perf.tpcds.Tables
// Tables in TPC-DS benchmark used by experiments.
// dsdgenDir is the location of dsdgen tool installed in your machines.
val tables = new Tables(sqlContext, dsdgenDir, scaleFactor)
// Generate data.
tables.genData(location, format, overwrite, partitionTables, useDoubleForDecimal, clusterByPartitionColumns, filterOutNullPartitionValues)
// Create metastore tables in a specified database for your data.
// Once tables are created, the current database will be switched to the specified database.
tables.createExternalTables(location, format, databaseName, overwrite)
// Or, if you want to create temporary tables
tables.createTemporaryTables(location, format)
// Setup TPC-DS experiment
import com.databricks.spark.sql.perf.tpcds.TPCDS
val tpcds = new TPCDS (sqlContext = sqlContext)

Run benchmarking queries

After setup, users can use runExperiment function to run benchmarking queries and record query execution time. Taking TPC-DS as an example, you can start an experiment by using

val experiment = tpcds.runExperiment(tpcds.interactiveQueries)

For every experiment run (i.e. every call of runExperiment), Spark SQL Perf will use the timestamp of the start time to identify this experiment. Performance results will be stored in the sub-dir named by the timestamp in the given resultsLocation (for example results/1429213883272). The performance results are stored in the JSON format.

Retrieve results

While the experiment is running you can use experiment.html to list the status. Once the experiment is complete, the results will be saved to the table sqlPerformance in json.

// Get all experiments results.
tpcds.createResultsTable()
sqlContext.table("sqlPerformance")
// Get the result of a particular run by specifying the timestamp of that run.
sqlContext.table("sqlPerformance").filter("timestamp = 1429132621024")