more cleanup, update readme
This commit is contained in:
parent
51b9dcb5b5
commit
a239da90a2
33
README.md
33
README.md
@ -1,55 +1,34 @@
|
||||
# Spark SQL Performance Tests
|
||||
|
||||
This is a performance testing framework for [Spark SQL](https://spark.apache.org/sql/) in [Apache Spark](https://spark.apache.org/) 1.3+.
|
||||
This is a performance testing framework for [Spark SQL](https://spark.apache.org/sql/) in [Apache Spark](https://spark.apache.org/) 1.4+.
|
||||
|
||||
**Note: This README is still under development. Please also check our source code for more information.**
|
||||
|
||||
## How to use it
|
||||
The rest of document will use TPC-DS benchmark as an example. We will add contents to explain how to use other benchmarks add the support of a new benchmark dataset in future.
|
||||
|
||||
### Setup a dataset
|
||||
Before running any query, a dataset needs to be setup by creating a `Dataset` object. Every benchmark support in Spark SQL Perf needs to implement its own `Dataset` class. A `Dataset` object takes a few parameters that will be used to setup the needed tables and its `setup` function is used to setup needed tables. For TPC-DS benchmark, the class is `TPCDS` in the package of `com.databricks.spark.sql.perf.tpcds`. For example, to setup a TPC-DS dataset, you can
|
||||
### Setup a benchmark
|
||||
Before running any query, a dataset needs to be setup by creating a `Benchmark` object.
|
||||
|
||||
```
|
||||
import org.apache.spark.sql.parquet.Tables
|
||||
// Tables in TPC-DS benchmark used by experiments.
|
||||
val tables = Tables(sqlContext)
|
||||
// Setup TPC-DS experiment
|
||||
val tpcds =
|
||||
new TPCDS (
|
||||
sqlContext = sqlContext,
|
||||
sparkVersion = "1.3.1",
|
||||
dataLocation = <the location of data>,
|
||||
dsdgenDir = <the location of dsdgen in every worker>,
|
||||
tables = tables.tables,
|
||||
scaleFactor = <scale factor>)
|
||||
val tpcds = new TPCDS (sqlContext = sqlContext)
|
||||
```
|
||||
|
||||
After a `TPCDS` object is created, tables of it can be setup by calling
|
||||
|
||||
```
|
||||
tpcds.setup()
|
||||
```
|
||||
|
||||
The `setup` function will first check if needed tables are stored at the location specified by `dataLocation`. If not, it will creates tables at there by using the data generator tool `dsdgen` provided by TPC-DS benchmark (This tool needs to be pre-installed at the location specified by `dsdgenDir` in every worker).
|
||||
|
||||
### Run benchmarking queries
|
||||
After setup, users can use `runExperiment` function to run benchmarking queries and record query execution time. Taking TPC-DS as an example, you can start an experiment by using
|
||||
|
||||
```
|
||||
tpcds.runExperiment(
|
||||
queries = <a Seq of Queries>,
|
||||
resultsLocation = <the root location of performance results>,
|
||||
includeBreakdown = <if measure the performance of every physical operators>,
|
||||
iterations = <the number of iterations>,
|
||||
variations = <variations used in the experiment>,
|
||||
tags = <tags of this experiment>)
|
||||
val experiment = tpcds.runExperiment(queriesToRun = tpcds.interactiveQueries)
|
||||
```
|
||||
|
||||
For every experiment run (i.e.\ every call of `runExperiment`), Spark SQL Perf will use the timestamp of the start time to identify this experiment. Performance results will be stored in the sub-dir named by the timestamp in the given `resultsLocation` (for example `results/1429213883272`). The performance results are stored in the JSON format.
|
||||
|
||||
### Retrieve results
|
||||
The follow code can be used to retrieve results ...
|
||||
While the experiment is running you can use `experiment.html` to list the status. Once the experiment is complete, the results will be saved to the table sqlPerformance in json.
|
||||
|
||||
```
|
||||
// Get experiments results.
|
||||
|
||||
@ -377,7 +377,7 @@ abstract class Benchmark(@transient protected val sqlContext: SQLContext)
|
||||
|${buildDataFrame.queryExecution.analyzed}
|
||||
""".stripMargin
|
||||
|
||||
val tablesInvolved = buildDataFrame.queryExecution.logical collect {
|
||||
lazy val tablesInvolved = buildDataFrame.queryExecution.logical collect {
|
||||
case UnresolvedRelation(tableIdentifier, _) => {
|
||||
// We are ignoring the database name.
|
||||
tableIdentifier.last
|
||||
|
||||
@ -14,7 +14,7 @@
|
||||
* limitations under the License.
|
||||
*/
|
||||
|
||||
package com.databricks.spark.sql.perf.tpcds.queries
|
||||
package com.databricks.spark.sql.perf.tpcds
|
||||
|
||||
import com.databricks.spark.sql.perf.Benchmark
|
||||
|
||||
@ -14,7 +14,7 @@
|
||||
* limitations under the License.
|
||||
*/
|
||||
|
||||
package com.databricks.spark.sql.perf.tpcds.queries
|
||||
package com.databricks.spark.sql.perf.tpcds
|
||||
|
||||
import com.databricks.spark.sql.perf.Benchmark
|
||||
|
||||
@ -22,27 +22,9 @@ import org.apache.spark.sql.SQLContext
|
||||
/**
|
||||
* TPC-DS benchmark's dataset.
|
||||
* @param sqlContext An existing SQLContext.
|
||||
* @param sparkVersion The version of Spark.
|
||||
* @param dataLocation The location of the dataset used by this experiment.
|
||||
* @param dsdgenDir The location of dsdgen in every worker machine.
|
||||
* @param scaleFactor The scale factor of the dataset. For some benchmarks like TPC-H
|
||||
* and TPC-DS, the scale factor is a number roughly representing the
|
||||
* size of raw data files. For some other benchmarks, the scale factor
|
||||
* is a short string describing the scale of the dataset.
|
||||
*/
|
||||
class TPCDS (
|
||||
@transient sqlContext: SQLContext,
|
||||
sparkVersion: String,
|
||||
dataLocation: String,
|
||||
dsdgenDir: String,
|
||||
scaleFactor: String,
|
||||
userSpecifiedBaseDir: Option[String] = None)
|
||||
extends Benchmark(sqlContext) with Serializable {
|
||||
import sqlContext._
|
||||
import sqlContext.implicits._
|
||||
|
||||
lazy val baseDir =
|
||||
userSpecifiedBaseDir.getOrElse(s"$dataLocation/scaleFactor=$scaleFactor/useDecimal=true")
|
||||
class TPCDS (@transient sqlContext: SQLContext)
|
||||
extends Benchmark(sqlContext) with ImpalaKitQueries with SimpleQueries with Serializable {
|
||||
|
||||
/*
|
||||
def setupBroadcast(skipTables: Seq[String] = Seq("store_sales", "customer")) = {
|
||||
|
||||
@ -1,17 +0,0 @@
|
||||
/*
|
||||
* Copyright 2015 Databricks Inc.
|
||||
*
|
||||
* Licensed under the Apache License, Version 2.0 (the "License");
|
||||
* you may not use this file except in compliance with the License.
|
||||
* You may obtain a copy of the License at
|
||||
*
|
||||
* http://www.apache.org/licenses/LICENSE-2.0
|
||||
*
|
||||
* Unless required by applicable law or agreed to in writing, software
|
||||
* distributed under the License is distributed on an "AS IS" BASIS,
|
||||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
* See the License for the specific language governing permissions and
|
||||
* limitations under the License.
|
||||
*/
|
||||
|
||||
package com.databricks.spark.sql.perf.tpcds
|
||||
Loading…
Reference in New Issue
Block a user