diff --git a/README.md b/README.md index c188a55..94308f2 100644 --- a/README.md +++ b/README.md @@ -1,55 +1,34 @@ # Spark SQL Performance Tests -This is a performance testing framework for [Spark SQL](https://spark.apache.org/sql/) in [Apache Spark](https://spark.apache.org/) 1.3+. +This is a performance testing framework for [Spark SQL](https://spark.apache.org/sql/) in [Apache Spark](https://spark.apache.org/) 1.4+. **Note: This README is still under development. Please also check our source code for more information.** ## How to use it The rest of document will use TPC-DS benchmark as an example. We will add contents to explain how to use other benchmarks add the support of a new benchmark dataset in future. -### Setup a dataset -Before running any query, a dataset needs to be setup by creating a `Dataset` object. Every benchmark support in Spark SQL Perf needs to implement its own `Dataset` class. A `Dataset` object takes a few parameters that will be used to setup the needed tables and its `setup` function is used to setup needed tables. For TPC-DS benchmark, the class is `TPCDS` in the package of `com.databricks.spark.sql.perf.tpcds`. For example, to setup a TPC-DS dataset, you can +### Setup a benchmark +Before running any query, a dataset needs to be setup by creating a `Benchmark` object. ``` import org.apache.spark.sql.parquet.Tables // Tables in TPC-DS benchmark used by experiments. val tables = Tables(sqlContext) // Setup TPC-DS experiment -val tpcds = - new TPCDS ( - sqlContext = sqlContext, - sparkVersion = "1.3.1", - dataLocation = , - dsdgenDir = , - tables = tables.tables, - scaleFactor = ) +val tpcds = new TPCDS (sqlContext = sqlContext) ``` -After a `TPCDS` object is created, tables of it can be setup by calling - -``` -tpcds.setup() -``` - -The `setup` function will first check if needed tables are stored at the location specified by `dataLocation`. If not, it will creates tables at there by using the data generator tool `dsdgen` provided by TPC-DS benchmark (This tool needs to be pre-installed at the location specified by `dsdgenDir` in every worker). - ### Run benchmarking queries After setup, users can use `runExperiment` function to run benchmarking queries and record query execution time. Taking TPC-DS as an example, you can start an experiment by using ``` -tpcds.runExperiment( - queries = , - resultsLocation = , - includeBreakdown = , - iterations = , - variations = , - tags = ) +val experiment = tpcds.runExperiment(queriesToRun = tpcds.interactiveQueries) ``` For every experiment run (i.e.\ every call of `runExperiment`), Spark SQL Perf will use the timestamp of the start time to identify this experiment. Performance results will be stored in the sub-dir named by the timestamp in the given `resultsLocation` (for example `results/1429213883272`). The performance results are stored in the JSON format. ### Retrieve results -The follow code can be used to retrieve results ... +While the experiment is running you can use `experiment.html` to list the status. Once the experiment is complete, the results will be saved to the table sqlPerformance in json. ``` // Get experiments results. diff --git a/src/main/scala/com/databricks/spark/sql/perf/Benchmark.scala b/src/main/scala/com/databricks/spark/sql/perf/Benchmark.scala index 1bc0955..5b1119c 100644 --- a/src/main/scala/com/databricks/spark/sql/perf/Benchmark.scala +++ b/src/main/scala/com/databricks/spark/sql/perf/Benchmark.scala @@ -377,7 +377,7 @@ abstract class Benchmark(@transient protected val sqlContext: SQLContext) |${buildDataFrame.queryExecution.analyzed} """.stripMargin - val tablesInvolved = buildDataFrame.queryExecution.logical collect { + lazy val tablesInvolved = buildDataFrame.queryExecution.logical collect { case UnresolvedRelation(tableIdentifier, _) => { // We are ignoring the database name. tableIdentifier.last diff --git a/src/main/scala/com/databricks/spark/sql/perf/query.scala b/src/main/scala/com/databricks/spark/sql/perf/query.scala deleted file mode 100644 index e69de29..0000000 diff --git a/src/main/scala/com/databricks/spark/sql/perf/runBenchmarks.scala b/src/main/scala/com/databricks/spark/sql/perf/runBenchmarks.scala deleted file mode 100644 index e69de29..0000000 diff --git a/src/main/scala/com/databricks/spark/sql/perf/table.scala b/src/main/scala/com/databricks/spark/sql/perf/table.scala deleted file mode 100644 index e69de29..0000000 diff --git a/src/main/scala/com/databricks/spark/sql/perf/tpcds/queries/ImpalaKitQueries.scala b/src/main/scala/com/databricks/spark/sql/perf/tpcds/ImpalaKitQueries.scala similarity index 99% rename from src/main/scala/com/databricks/spark/sql/perf/tpcds/queries/ImpalaKitQueries.scala rename to src/main/scala/com/databricks/spark/sql/perf/tpcds/ImpalaKitQueries.scala index 063ea55..4f4c3e8 100644 --- a/src/main/scala/com/databricks/spark/sql/perf/tpcds/queries/ImpalaKitQueries.scala +++ b/src/main/scala/com/databricks/spark/sql/perf/tpcds/ImpalaKitQueries.scala @@ -14,7 +14,7 @@ * limitations under the License. */ -package com.databricks.spark.sql.perf.tpcds.queries +package com.databricks.spark.sql.perf.tpcds import com.databricks.spark.sql.perf.Benchmark diff --git a/src/main/scala/com/databricks/spark/sql/perf/tpcds/queries/SimpleQueries.scala b/src/main/scala/com/databricks/spark/sql/perf/tpcds/SimpleQueries.scala similarity index 99% rename from src/main/scala/com/databricks/spark/sql/perf/tpcds/queries/SimpleQueries.scala rename to src/main/scala/com/databricks/spark/sql/perf/tpcds/SimpleQueries.scala index cb26820..383c68f 100644 --- a/src/main/scala/com/databricks/spark/sql/perf/tpcds/queries/SimpleQueries.scala +++ b/src/main/scala/com/databricks/spark/sql/perf/tpcds/SimpleQueries.scala @@ -14,7 +14,7 @@ * limitations under the License. */ -package com.databricks.spark.sql.perf.tpcds.queries +package com.databricks.spark.sql.perf.tpcds import com.databricks.spark.sql.perf.Benchmark diff --git a/src/main/scala/com/databricks/spark/sql/perf/tpcds/TPCDS.scala b/src/main/scala/com/databricks/spark/sql/perf/tpcds/TPCDS.scala index 2c618ac..dab3fbd 100644 --- a/src/main/scala/com/databricks/spark/sql/perf/tpcds/TPCDS.scala +++ b/src/main/scala/com/databricks/spark/sql/perf/tpcds/TPCDS.scala @@ -22,27 +22,9 @@ import org.apache.spark.sql.SQLContext /** * TPC-DS benchmark's dataset. * @param sqlContext An existing SQLContext. - * @param sparkVersion The version of Spark. - * @param dataLocation The location of the dataset used by this experiment. - * @param dsdgenDir The location of dsdgen in every worker machine. - * @param scaleFactor The scale factor of the dataset. For some benchmarks like TPC-H - * and TPC-DS, the scale factor is a number roughly representing the - * size of raw data files. For some other benchmarks, the scale factor - * is a short string describing the scale of the dataset. */ -class TPCDS ( - @transient sqlContext: SQLContext, - sparkVersion: String, - dataLocation: String, - dsdgenDir: String, - scaleFactor: String, - userSpecifiedBaseDir: Option[String] = None) - extends Benchmark(sqlContext) with Serializable { - import sqlContext._ - import sqlContext.implicits._ - - lazy val baseDir = - userSpecifiedBaseDir.getOrElse(s"$dataLocation/scaleFactor=$scaleFactor/useDecimal=true") +class TPCDS (@transient sqlContext: SQLContext) + extends Benchmark(sqlContext) with ImpalaKitQueries with SimpleQueries with Serializable { /* def setupBroadcast(skipTables: Seq[String] = Seq("store_sales", "customer")) = { diff --git a/src/main/scala/com/databricks/spark/sql/perf/tpcds/queries/package.scala b/src/main/scala/com/databricks/spark/sql/perf/tpcds/queries/package.scala deleted file mode 100644 index 65f72fe..0000000 --- a/src/main/scala/com/databricks/spark/sql/perf/tpcds/queries/package.scala +++ /dev/null @@ -1,17 +0,0 @@ -/* - * Copyright 2015 Databricks Inc. - * - * Licensed under the Apache License, Version 2.0 (the "License"); - * you may not use this file except in compliance with the License. - * You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package com.databricks.spark.sql.perf.tpcds