Now that Spark 2.0.0 is released, we need to update the build to use a released version instead of the snapshot (which is no longer available).
Fixes#84.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#85 from JoshRosen/fix-spark-dep.
Updates the usual scala-logging issues to make the source code cross-compilable between scala 2.10 and scala 2.11.
Tests:
A scala 2.11 version of the code has been run against the official Spark 2.0.0 RC4 binary release (Scala 2.11)
A scala 2.10 version has been run against the official Spark 1.6.2 release
Author: Timothy Hunter <timhunter@databricks.com>
Closes#81 from thunterdb/1607-scala211.
This has been tested locally with a small amount of data.
I have not bothered to reimplement a more robust version of the ALS synthetic data generation, so it will still require some manual parameter tweaking as before.
Author: Timothy Hunter <timhunter@databricks.com>
Closes#76 from thunterdb/1607-als.
This PR adds basic MLlib infrastructure to run some benchmarks against ML pipelines.
There are 2 ways to describe and run ML pipelines:
- programatically, in scala (see MLBenchmarks.scala)
- using a simple YAML file (see mllib-small.yaml for an example)
The YAML approach is preferred because it generates programmatically the cartesian product of all the experiments to run and validates the types of the objects in the yaml file.
In both cases, all the ML experiments are standard benchmarks.
This PR also moves some code in `Benchmark.scala` : the current code generates path-dependent structural signatures and confuses intellij.
It does not include tests, but some small benchmarks can be run locally against a spark 2 installation:
```
$SPARK_HOME/bin/spark-shell --jars $PWD/target/scala-2.10/spark-sql-perf-assembly-0.4.9-SNAPSHOT.jar
```
and then:
```scala
com.databricks.spark.sql.perf.mllib.MLLib.run(yamlFile="src/main/scala/configs/mllib-small.yaml")
```
Author: Timothy Hunter <timhunter@databricks.com>
Closes#69 from thunterdb/1605-mllib2.
This patch extracts `Query` into its own top-level class and makes its `sparkContext` field transient in order to fix `NotSerializableException`s.
Author: Josh Rosen <rosenville@gmail.com>
Closes#53 from JoshRosen/make-query-into-top-level-class.
This patch adds additional constructors to `TPCDS` to maintain backwards-compatibility with code which calls `new TPCDS(anExistingSqlContext)`. This constructor was removed in #47.
The motivation for backwards-compatibility here is to simplify the gradual roll-out of an updated spark-sql-perf library to some existing jobs which share the same notebook.
Author: Josh Rosen <rosenville@gmail.com>
Closes#52 from JoshRosen/backwards-compatible-tpcds-constructor.