more cleanup, update readme

2015-08-11 15:51:34 -07:00 · 2015-08-11 15:51:34 -07:00 · a239da90a2
commit a239da90a2
parent 51b9dcb5b5
9 changed files with 11 additions and 67 deletions
--- a/README.md
+++ b/README.md
@ -1,55 +1,34 @@
 # Spark SQL Performance Tests

-This is a performance testing framework for [Spark SQL](https://spark.apache.org/sql/) in [Apache Spark](https://spark.apache.org/) 1.3+.
+This is a performance testing framework for [Spark SQL](https://spark.apache.org/sql/) in [Apache Spark](https://spark.apache.org/) 1.4+.

 **Note: This README is still under development. Please also check our source code for more information.**

 ## How to use it
 The rest of document will use TPC-DS benchmark as an example. We will add contents to explain how to use other benchmarks add the support of a new benchmark dataset in future.

-### Setup a dataset
-Before running any query, a dataset needs to be setup by creating a `Dataset` object. Every benchmark support in Spark SQL Perf needs to implement its own `Dataset` class. A `Dataset` object takes a few parameters that will be used to setup the needed tables and its `setup` function is used to setup needed tables. For TPC-DS benchmark, the class is `TPCDS` in the package of `com.databricks.spark.sql.perf.tpcds`. For example, to setup a TPC-DS dataset, you can 
+### Setup a benchmark
+Before running any query, a dataset needs to be setup by creating a `Benchmark` object.   

 ```
 import org.apache.spark.sql.parquet.Tables
 // Tables in TPC-DS benchmark used by experiments.
 val tables = Tables(sqlContext)
 // Setup TPC-DS experiment
-val tpcds =
-  new TPCDS (
-    sqlContext = sqlContext,
-    sparkVersion = "1.3.1",
-    dataLocation = <the location of data>,
-    dsdgenDir = <the location of dsdgen in every worker>,
-    tables = tables.tables,
-    scaleFactor = <scale factor>)
+val tpcds = new TPCDS (sqlContext = sqlContext)
 ```

-After a `TPCDS` object is created, tables of it can be setup by calling
-
-```
-tpcds.setup()
-```
-
-The `setup` function will first check if needed tables are stored at the location specified by `dataLocation`. If not, it will creates tables at there by using the data generator tool `dsdgen` provided by TPC-DS benchmark (This tool needs to be pre-installed at the location specified by `dsdgenDir` in every worker).
-
 ### Run benchmarking queries
 After setup, users can use `runExperiment` function to run benchmarking queries and record query execution time. Taking TPC-DS as an example, you can start an experiment by using

 ```
-tpcds.runExperiment(
-  queries = <a Seq of Queries>,
-  resultsLocation = <the root location of performance results>,
-  includeBreakdown = <if measure the performance of every physical operators>,
-  iterations = <the number of iterations>,
-  variations = <variations used in the experiment>,
-  tags = <tags of this experiment>)
+val experiment = tpcds.runExperiment(queriesToRun = tpcds.interactiveQueries)
 ```

 For every experiment run (i.e.\ every call of `runExperiment`), Spark SQL Perf will use the timestamp of the start time to identify this experiment. Performance results will be stored in the sub-dir named by the timestamp in the given `resultsLocation` (for example `results/1429213883272`). The performance results are stored in the JSON format.

 ### Retrieve results
-The follow code can be used to retrieve results ...
+While the experiment is running you can use `experiment.html` to list the status.  Once the experiment is complete, the results will be saved to the table sqlPerformance in json.

 ```
 // Get experiments results.
--- a/src/main/scala/com/databricks/spark/sql/perf/Benchmark.scala
+++ b/src/main/scala/com/databricks/spark/sql/perf/Benchmark.scala
@ -377,7 +377,7 @@ abstract class Benchmark(@transient protected val sqlContext: SQLContext)
         |${buildDataFrame.queryExecution.analyzed}
       """.stripMargin

-    val tablesInvolved = buildDataFrame.queryExecution.logical collect {
+    lazy val tablesInvolved = buildDataFrame.queryExecution.logical collect {
      case UnresolvedRelation(tableIdentifier, _) => {
        // We are ignoring the database name.
        tableIdentifier.last
--- a/src/main/scala/com/databricks/spark/sql/perf/query.scala
+++ b/src/main/scala/com/databricks/spark/sql/perf/query.scala
--- a/src/main/scala/com/databricks/spark/sql/perf/runBenchmarks.scala
+++ b/src/main/scala/com/databricks/spark/sql/perf/runBenchmarks.scala
--- a/src/main/scala/com/databricks/spark/sql/perf/table.scala
+++ b/src/main/scala/com/databricks/spark/sql/perf/table.scala
--- a/src/main/scala/com/databricks/spark/sql/perf/tpcds/queries/ImpalaKitQueries.scala
+++ b/src/main/scala/com/databricks/spark/sql/perf/tpcds/queries/ImpalaKitQueries.scala
@ -14,7 +14,7 @@
 * limitations under the License.
 */

-package com.databricks.spark.sql.perf.tpcds.queries
+package com.databricks.spark.sql.perf.tpcds

 import com.databricks.spark.sql.perf.Benchmark

--- a/src/main/scala/com/databricks/spark/sql/perf/tpcds/queries/SimpleQueries.scala
+++ b/src/main/scala/com/databricks/spark/sql/perf/tpcds/queries/SimpleQueries.scala
@ -14,7 +14,7 @@
 * limitations under the License.
 */

-package com.databricks.spark.sql.perf.tpcds.queries
+package com.databricks.spark.sql.perf.tpcds

 import com.databricks.spark.sql.perf.Benchmark

--- a/src/main/scala/com/databricks/spark/sql/perf/tpcds/TPCDS.scala
+++ b/src/main/scala/com/databricks/spark/sql/perf/tpcds/TPCDS.scala
@ -22,27 +22,9 @@ import org.apache.spark.sql.SQLContext
 /**
 * TPC-DS benchmark's dataset.
 * @param sqlContext An existing SQLContext.
- * @param sparkVersion The version of Spark.
- * @param dataLocation The location of the dataset used by this experiment.
- * @param dsdgenDir The location of dsdgen in every worker machine.
- * @param scaleFactor The scale factor of the dataset. For some benchmarks like TPC-H
- *                    and TPC-DS, the scale factor is a number roughly representing the
- *                    size of raw data files. For some other benchmarks, the scale factor
- *                    is a short string describing the scale of the dataset.
 */
-class TPCDS (
-    @transient sqlContext: SQLContext,
-    sparkVersion: String,
-    dataLocation: String,
-    dsdgenDir: String,
-    scaleFactor: String,
-    userSpecifiedBaseDir: Option[String] = None)
-  extends Benchmark(sqlContext) with Serializable {
-  import sqlContext._
-  import sqlContext.implicits._
-
-  lazy val baseDir =
-    userSpecifiedBaseDir.getOrElse(s"$dataLocation/scaleFactor=$scaleFactor/useDecimal=true")
+class TPCDS (@transient sqlContext: SQLContext)
+  extends Benchmark(sqlContext) with ImpalaKitQueries with SimpleQueries with Serializable {

  /*
  def setupBroadcast(skipTables: Seq[String] = Seq("store_sales", "customer")) = {
--- a/src/main/scala/com/databricks/spark/sql/perf/tpcds/queries/package.scala
+++ b/src/main/scala/com/databricks/spark/sql/perf/tpcds/queries/package.scala
@ -1,17 +0,0 @@
-/*
- * Copyright 2015 Databricks Inc.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *    http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-package com.databricks.spark.sql.perf.tpcds