## What changes were proposed in this pull request?
Since SPARK-20690 and SPARK-20916 Spark requires all subqueries in FROM clause to have an alias name.
## How was this patch tested?
Tested on SF1.
Data generation:
* Add an option to change Dates to Strings, and specify it in Tables object creator.
* Add discovering partitions to createExternalTables
* Add analyzeTables function that gathers statistics.
Benchmark execution:
* Perform collect() on Dataframe, so that it is recorded by SQL SparkUI.
Now that Spark 2.0.0 is released, we need to update the build to use a released version instead of the snapshot (which is no longer available).
Fixes#84.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#85 from JoshRosen/fix-spark-dep.
Updates the usual scala-logging issues to make the source code cross-compilable between scala 2.10 and scala 2.11.
Tests:
A scala 2.11 version of the code has been run against the official Spark 2.0.0 RC4 binary release (Scala 2.11)
A scala 2.10 version has been run against the official Spark 1.6.2 release
Author: Timothy Hunter <timhunter@databricks.com>
Closes#81 from thunterdb/1607-scala211.
This has been tested locally with a small amount of data.
I have not bothered to reimplement a more robust version of the ALS synthetic data generation, so it will still require some manual parameter tweaking as before.
Author: Timothy Hunter <timhunter@databricks.com>
Closes#76 from thunterdb/1607-als.
This PR adds basic MLlib infrastructure to run some benchmarks against ML pipelines.
There are 2 ways to describe and run ML pipelines:
- programatically, in scala (see MLBenchmarks.scala)
- using a simple YAML file (see mllib-small.yaml for an example)
The YAML approach is preferred because it generates programmatically the cartesian product of all the experiments to run and validates the types of the objects in the yaml file.
In both cases, all the ML experiments are standard benchmarks.
This PR also moves some code in `Benchmark.scala` : the current code generates path-dependent structural signatures and confuses intellij.
It does not include tests, but some small benchmarks can be run locally against a spark 2 installation:
```
$SPARK_HOME/bin/spark-shell --jars $PWD/target/scala-2.10/spark-sql-perf-assembly-0.4.9-SNAPSHOT.jar
```
and then:
```scala
com.databricks.spark.sql.perf.mllib.MLLib.run(yamlFile="src/main/scala/configs/mllib-small.yaml")
```
Author: Timothy Hunter <timhunter@databricks.com>
Closes#69 from thunterdb/1605-mllib2.