This PR adds basic MLlib infrastructure to run some benchmarks against ML pipelines.
There are 2 ways to describe and run ML pipelines:
- programatically, in scala (see MLBenchmarks.scala)
- using a simple YAML file (see mllib-small.yaml for an example)
The YAML approach is preferred because it generates programmatically the cartesian product of all the experiments to run and validates the types of the objects in the yaml file.
In both cases, all the ML experiments are standard benchmarks.
This PR also moves some code in `Benchmark.scala` : the current code generates path-dependent structural signatures and confuses intellij.
It does not include tests, but some small benchmarks can be run locally against a spark 2 installation:
```
$SPARK_HOME/bin/spark-shell --jars $PWD/target/scala-2.10/spark-sql-perf-assembly-0.4.9-SNAPSHOT.jar
```
and then:
```scala
com.databricks.spark.sql.perf.mllib.MLLib.run(yamlFile="src/main/scala/configs/mllib-small.yaml")
```
Author: Timothy Hunter <timhunter@databricks.com>
Closes#69 from thunterdb/1605-mllib2.
This patch extracts `Query` into its own top-level class and makes its `sparkContext` field transient in order to fix `NotSerializableException`s.
Author: Josh Rosen <rosenville@gmail.com>
Closes#53 from JoshRosen/make-query-into-top-level-class.
This patch adds additional constructors to `TPCDS` to maintain backwards-compatibility with code which calls `new TPCDS(anExistingSqlContext)`. This constructor was removed in #47.
The motivation for backwards-compatibility here is to simplify the gradual roll-out of an updated spark-sql-perf library to some existing jobs which share the same notebook.
Author: Josh Rosen <rosenville@gmail.com>
Closes#52 from JoshRosen/backwards-compatible-tpcds-constructor.
- Scripts for running the benchmark either while working on spark-sql-perf (bin/run) or while working on Spark (bin/spark-perf). The latter uses Spark's sbt build to compile spark and downloads the most recent published version of spark-sql-perf.
- Adds a `--compare` that can be used to compare the results with a baseline run
Author: Michael Armbrust <michael@databricks.com>
Closes#49 from marmbrus/runner.
This PR adds the ability to run performance test locally as a stand alone program that reports the results to the console:
```
$ bin/run --help
spark-sql-perf 0.2.0
Usage: spark-sql-perf [options]
-b <value> | --benchmark <value>
the name of the benchmark to run
-f <value> | --filter <value>
a filter on the name of the queries to run
-i <value> | --iterations <value>
the number of iterations to run
--help
prints this usage text
$ bin/run --benchmark DatasetPerformance
```
Author: Michael Armbrust <michael@databricks.com>
Closes#47 from marmbrus/MainClass.
After this you should be able to use the library in the shell as follows:
```
bin/spark-shell --packages com.databricks:spark-sql-perf:0.2.3
```
Author: Michael Armbrust <michael@databricks.com>
Closes#46 from marmbrus/publishToMaven.