How to use it:
```
build/sbt "test:runMain com.databricks.spark.sql.perf.tpcds.GenTPCDSData -d /root/tmp/tpcds-kit/tools -s 5 -l /root/tmp/tpcds5g -f parquet"
```
```
[root@spark-3267648 spark-sql-perf]# build/sbt "test:runMain com.databricks.spark.sql.perf.tpcds.GenTPCDSData --help"
[info] Running com.databricks.spark.sql.perf.tpcds.GenTPCDSData --help
[info] Usage: Gen-TPC-DS-data [options]
[info]
[info] -m, --master <value> the Spark master to use, default to local[*]
[info] -d, --dsdgenDir <value> location of dsdgen
[info] -s, --scaleFactor <value>
[info] scaleFactor defines the size of the dataset to generate (in GB)
[info] -l, --location <value> root directory of location to create data in
[info] -f, --format <value> valid spark format, Parquet, ORC ...
[info] -i, --useDoubleForDecimal <value>
[info] true to replace DecimalType with DoubleType
[info] -e, --useStringForDate <value>
[info] true to replace DateType with StringType
[info] -o, --overwrite <value> overwrite the data that is already there
[info] -p, --partitionTables <value>
[info] create the partitioned fact tables
[info] -c, --clusterByPartitionColumns <value>
[info] shuffle to get partitions coalesced into single files
[info] -v, --filterOutNullPartitionValues <value>
[info] true to filter out the partition with NULL key value
[info] -t, --tableFilter <value>
[info] "" means generate all tables
[info] -n, --numPartitions <value>
[info] how many dsdgen partitions to run - number of input tasks.
[info] --help prints this usage text
```
* Refactor deprecated `getOrCreate()` in spark 3
* Compile with scala 2.12
* Updated usage related to obsolete/deprecated features
* remove use of scala-logging replaced by using slf4j directly
* Adding basic partitioning to TPCH tables following VectorH paper as baseline
* Multi datagen (TPC- H and DS) and multi scale factor notebook/script.
Generates all the selected scale factors and benchmarks in one run.
* TPCH runner notebook or script for spark-shell
* Adding basic TPCH documentation
* Made small updates in Benchmark.scala and Query.scala for Spark 2.2
* Added tests for NaiveBayesModel and Bucketizer
* Changed BenchmarkAlgorithm.getEstimator() -> BenchmarkAlgorithm.getPipelineStage() to allow for the benchmarking of Estimators and Transformers instead of just Estimators
Commits:
* Changes made so that spark-sql-perf compiles with Spark 2.2
* Updates for running ML tests from the command line + added Naive Bayes test
* Add Bucketizer test as example of Featurizer test; change getEstimator() to getPipelineStage() in
BenchmarkAlgorithm to allow for testing of transformers in addition to estimators.
* Add comment for main method in MLlib.scala
* Rename MLTransformerBenchmarkable --> MLPipelineStageBenchmarkable, fix issue with NaiveBayes param
* Add UnaryTransformer trait for common data/methods to be shared across all objects
testing featurizers that operate on a single column (StringIndexer, OneHotEncoder, Bucketizer, HashingTF, etc)
* Respond to review comments:
* bin/run-ml: Add newline at EOF
* Query.scala: organized imports
* MLlib.scala: organized imports, fixed SparkContext initialization
* NaiveBayes.scala: removed unused temp val, improved probability calculation in trueModel()
* Bucketizer.scala: use DataGenerator.generateContinuousFeatures instead of generating data on the driver
* Fix bug in Bucketizer.scala
* Precompute log of sum of unnormalized probabilities in NaiveBayes.scala, add NaiveBayes and Bucketizer tests to mllib-small.yaml
* Update Query.scala to use p() to access SparkPlans under a given SparkPlan
* Update README to indicate that spark-sql-perf only works with Spark 2.2+ after this PR
This PR adds the ability to run performance test locally as a stand alone program that reports the results to the console:
```
$ bin/run --help
spark-sql-perf 0.2.0
Usage: spark-sql-perf [options]
-b <value> | --benchmark <value>
the name of the benchmark to run
-f <value> | --filter <value>
a filter on the name of the queries to run
-i <value> | --iterations <value>
the number of iterations to run
--help
prints this usage text
$ bin/run --benchmark DatasetPerformance
```
Author: Michael Armbrust <michael@databricks.com>
Closes#47 from marmbrus/MainClass.