* Refactor deprecated `getOrCreate()` in spark 3
* Compile with scala 2.12
* Updated usage related to obsolete/deprecated features
* remove use of scala-logging replaced by using slf4j directly
Reverts #157 due to library errors when the previous was is in the classpath already (i.e., in databricks) and not bringing any noted improvements or needed fixes. Exception:
java.lang.InstantiationError: com.typesafe.scalalogging.Logger
This reverts commit 56f7348.
* Made small updates in Benchmark.scala and Query.scala for Spark 2.2
* Added tests for NaiveBayesModel and Bucketizer
* Changed BenchmarkAlgorithm.getEstimator() -> BenchmarkAlgorithm.getPipelineStage() to allow for the benchmarking of Estimators and Transformers instead of just Estimators
Commits:
* Changes made so that spark-sql-perf compiles with Spark 2.2
* Updates for running ML tests from the command line + added Naive Bayes test
* Add Bucketizer test as example of Featurizer test; change getEstimator() to getPipelineStage() in
BenchmarkAlgorithm to allow for testing of transformers in addition to estimators.
* Add comment for main method in MLlib.scala
* Rename MLTransformerBenchmarkable --> MLPipelineStageBenchmarkable, fix issue with NaiveBayes param
* Add UnaryTransformer trait for common data/methods to be shared across all objects
testing featurizers that operate on a single column (StringIndexer, OneHotEncoder, Bucketizer, HashingTF, etc)
* Respond to review comments:
* bin/run-ml: Add newline at EOF
* Query.scala: organized imports
* MLlib.scala: organized imports, fixed SparkContext initialization
* NaiveBayes.scala: removed unused temp val, improved probability calculation in trueModel()
* Bucketizer.scala: use DataGenerator.generateContinuousFeatures instead of generating data on the driver
* Fix bug in Bucketizer.scala
* Precompute log of sum of unnormalized probabilities in NaiveBayes.scala, add NaiveBayes and Bucketizer tests to mllib-small.yaml
* Update Query.scala to use p() to access SparkPlans under a given SparkPlan
* Update README to indicate that spark-sql-perf only works with Spark 2.2+ after this PR
Now that Spark 2.0.0 is released, we need to update the build to use a released version instead of the snapshot (which is no longer available).
Fixes#84.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#85 from JoshRosen/fix-spark-dep.
Updates the usual scala-logging issues to make the source code cross-compilable between scala 2.10 and scala 2.11.
Tests:
A scala 2.11 version of the code has been run against the official Spark 2.0.0 RC4 binary release (Scala 2.11)
A scala 2.10 version has been run against the official Spark 1.6.2 release
Author: Timothy Hunter <timhunter@databricks.com>
Closes#81 from thunterdb/1607-scala211.
This PR adds basic MLlib infrastructure to run some benchmarks against ML pipelines.
There are 2 ways to describe and run ML pipelines:
- programatically, in scala (see MLBenchmarks.scala)
- using a simple YAML file (see mllib-small.yaml for an example)
The YAML approach is preferred because it generates programmatically the cartesian product of all the experiments to run and validates the types of the objects in the yaml file.
In both cases, all the ML experiments are standard benchmarks.
This PR also moves some code in `Benchmark.scala` : the current code generates path-dependent structural signatures and confuses intellij.
It does not include tests, but some small benchmarks can be run locally against a spark 2 installation:
```
$SPARK_HOME/bin/spark-shell --jars $PWD/target/scala-2.10/spark-sql-perf-assembly-0.4.9-SNAPSHOT.jar
```
and then:
```scala
com.databricks.spark.sql.perf.mllib.MLLib.run(yamlFile="src/main/scala/configs/mllib-small.yaml")
```
Author: Timothy Hunter <timhunter@databricks.com>
Closes#69 from thunterdb/1605-mllib2.
- Scripts for running the benchmark either while working on spark-sql-perf (bin/run) or while working on Spark (bin/spark-perf). The latter uses Spark's sbt build to compile spark and downloads the most recent published version of spark-sql-perf.
- Adds a `--compare` that can be used to compare the results with a baseline run
Author: Michael Armbrust <michael@databricks.com>
Closes#49 from marmbrus/runner.
This PR adds the ability to run performance test locally as a stand alone program that reports the results to the console:
```
$ bin/run --help
spark-sql-perf 0.2.0
Usage: spark-sql-perf [options]
-b <value> | --benchmark <value>
the name of the benchmark to run
-f <value> | --filter <value>
a filter on the name of the queries to run
-i <value> | --iterations <value>
the number of iterations to run
--help
prints this usage text
$ bin/run --benchmark DatasetPerformance
```
Author: Michael Armbrust <michael@databricks.com>
Closes#47 from marmbrus/MainClass.
After this you should be able to use the library in the shell as follows:
```
bin/spark-shell --packages com.databricks:spark-sql-perf:0.2.3
```
Author: Michael Armbrust <michael@databricks.com>
Closes#46 from marmbrus/publishToMaven.
Some internal interfaces changed, so we need to bump the Spark version to run tests on Spark 1.6.
Author: Michael Armbrust <michael@databricks.com>
Closes#29 from marmbrus/spark16.