Now that Spark 2.0.0 is released, we need to update the build to use a released version instead of the snapshot (which is no longer available).
Fixes#84.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#85 from JoshRosen/fix-spark-dep.
Updates the usual scala-logging issues to make the source code cross-compilable between scala 2.10 and scala 2.11.
Tests:
A scala 2.11 version of the code has been run against the official Spark 2.0.0 RC4 binary release (Scala 2.11)
A scala 2.10 version has been run against the official Spark 1.6.2 release
Author: Timothy Hunter <timhunter@databricks.com>
Closes#81 from thunterdb/1607-scala211.
This has been tested locally with a small amount of data.
I have not bothered to reimplement a more robust version of the ALS synthetic data generation, so it will still require some manual parameter tweaking as before.
Author: Timothy Hunter <timhunter@databricks.com>
Closes#76 from thunterdb/1607-als.
This PR adds basic MLlib infrastructure to run some benchmarks against ML pipelines.
There are 2 ways to describe and run ML pipelines:
- programatically, in scala (see MLBenchmarks.scala)
- using a simple YAML file (see mllib-small.yaml for an example)
The YAML approach is preferred because it generates programmatically the cartesian product of all the experiments to run and validates the types of the objects in the yaml file.
In both cases, all the ML experiments are standard benchmarks.
This PR also moves some code in `Benchmark.scala` : the current code generates path-dependent structural signatures and confuses intellij.
It does not include tests, but some small benchmarks can be run locally against a spark 2 installation:
```
$SPARK_HOME/bin/spark-shell --jars $PWD/target/scala-2.10/spark-sql-perf-assembly-0.4.9-SNAPSHOT.jar
```
and then:
```scala
com.databricks.spark.sql.perf.mllib.MLLib.run(yamlFile="src/main/scala/configs/mllib-small.yaml")
```
Author: Timothy Hunter <timhunter@databricks.com>
Closes#69 from thunterdb/1605-mllib2.