Commit Graph

34 Commits

Author SHA1 Message Date
Yuming Wang
65785a8a04
Fix Travis CI JDK installation (#195)
* Replace oraclejdk8 with openjdk8
* Update .travis.yml
2021-01-28 17:28:46 +01:00
Nico Poggi
d85f75bb38
Update for Spark 3.0.0 compatibility (#191)
* Updating the build file to spark 3.0.0 and scala 2.12.10
* Fixing incompatibilities
* Adding default parameters to newer required functions
* Removing HiveTest
2020-11-03 15:27:34 +01:00
Luca Canali
e1e1365a87 Updates for Spark 3.0 and Scala 2.12 compatibility (#176)
* Refactor deprecated `getOrCreate()` in spark 3
* Compile with scala 2.12
* Updated usage related to obsolete/deprecated features
* remove use of scala-logging replaced by using slf4j directly
2019-01-29 09:58:52 +01:00
Bago Amirbekian
85bbfd4ca2 [ML-5437] Build with spark-2.4.0 and resolve build issues (#174)
We made some changes to related to new APIs in spark2.4. These APIs were reverted because they were breaking changes so we need to revert our changes.
2018-11-09 16:21:22 -08:00
Nico Poggi
d44caec277
Revert "Update Scala Logging to officially supported one " (#172)
Reverts #157 due to library errors when the previous was is in the classpath already (i.e., in databricks) and not bringing any noted improvements or needed fixes. Exception:
java.lang.InstantiationError: com.typesafe.scalalogging.Logger
This reverts commit 56f7348.
2018-10-19 17:33:34 +02:00
Piotr Mrówczyński
56f73482d7 Update Scala Logging to officially supported one 2018-09-11 12:17:06 +02:00
Xiangrui Meng
bb12958874
Fix compile for Spark 2.4 SNAPSHOT and only catch NonFatal (#164)
* only catch non-fatal exceptions

* remove afterBenchmark for MLlib

* fix compile

* use Apache snapshot releases
2018-09-10 08:49:31 -07:00
ludatabricks
e8aa132bb8 [ML-3870] Make spark-sql-perf master compiled with spark 2.3 and scala 2.11 (#155)
Change the build config to update spark 2.3 and update the scala dependence in bin/spark-perf
2018-06-15 06:40:14 -07:00
Siddharth Murching
d0de5ae8aa Update tests to run with Spark 2.2, add NaiveBayes & Bucketizer ML tests (#110)
* Made small updates in Benchmark.scala and Query.scala for Spark 2.2
* Added tests for NaiveBayesModel and Bucketizer
* Changed BenchmarkAlgorithm.getEstimator() -> BenchmarkAlgorithm.getPipelineStage() to allow for the benchmarking of Estimators and Transformers instead of just Estimators

Commits:
* Changes made so that spark-sql-perf compiles with Spark 2.2

* Updates for running ML tests from the command line + added Naive Bayes test

* Add Bucketizer test as example of Featurizer test; change getEstimator() to getPipelineStage() in
BenchmarkAlgorithm to allow for testing of transformers in addition to estimators.

* Add comment for main method in MLlib.scala

* Rename MLTransformerBenchmarkable --> MLPipelineStageBenchmarkable, fix issue with NaiveBayes param

* Add UnaryTransformer trait for common data/methods to be shared across all objects
testing featurizers that operate on a single column (StringIndexer, OneHotEncoder, Bucketizer, HashingTF, etc)

* Respond to review comments:

* bin/run-ml: Add newline at EOF
* Query.scala: organized imports
* MLlib.scala: organized imports, fixed SparkContext initialization
* NaiveBayes.scala: removed unused temp val, improved probability calculation in trueModel()
* Bucketizer.scala: use DataGenerator.generateContinuousFeatures instead of generating data on the driver

* Fix bug in Bucketizer.scala

* Precompute log of sum of unnormalized probabilities in NaiveBayes.scala, add NaiveBayes and Bucketizer tests to mllib-small.yaml

* Update Query.scala to use p() to access SparkPlans under a given SparkPlan

* Update README to indicate that spark-sql-perf only works with Spark 2.2+ after this PR
2017-08-21 15:07:46 -07:00
Eric Liang
64728c7cff Add option to avoid cleaning after each run, to enable parallel runs 2017-03-14 19:45:27 -07:00
Timothy Hunter
53091a1935 Removes labels from tree data generation (#82)
* changes

* removes labels

* reset scala version

* adding metadata

* bumping spark release
2016-12-13 16:47:31 -08:00
srinathshankar
685c50d9dc Cross build with Scala 2.11 (#91)
* Cross build with Scala 2.11

* Update snapshot version
2016-10-03 17:01:17 -07:00
Josh Rosen
c2224f37e5 Depend on non-snapshot Spark now that 2.0.0 is released
Now that Spark 2.0.0 is released, we need to update the build to use a released version instead of the snapshot (which is no longer available).

Fixes #84.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #85 from JoshRosen/fix-spark-dep.
2016-08-17 17:53:30 -07:00
Timothy Hunter
948c8369e7 Fixes issues with scala 2.11
Updates the usual scala-logging issues to make the source code cross-compilable between scala 2.10 and scala 2.11.

Tests:
A scala 2.11 version of the code has been run against the official Spark 2.0.0 RC4 binary release (Scala 2.11)
A scala 2.10 version has been run against the official Spark 1.6.2 release

Author: Timothy Hunter <timhunter@databricks.com>

Closes #81 from thunterdb/1607-scala211.
2016-07-19 11:19:52 -07:00
Timothy Hunter
1388722b81 Initial commit for adding MLlib reporting in spark-sql-perf
This PR adds basic MLlib infrastructure to run some benchmarks against ML pipelines.

There are 2 ways to describe and run ML pipelines:
 - programatically, in scala (see MLBenchmarks.scala)
 - using a simple YAML file (see mllib-small.yaml for an example)
The YAML approach is preferred because it generates programmatically the cartesian product of all the experiments to run and validates the types of the objects in the yaml file.

In both cases, all the ML experiments are standard benchmarks.

This PR also moves some code in `Benchmark.scala` : the current code generates path-dependent structural signatures and confuses intellij.

It does not include tests, but some small benchmarks can be run locally against a spark 2 installation:

```
$SPARK_HOME/bin/spark-shell --jars $PWD/target/scala-2.10/spark-sql-perf-assembly-0.4.9-SNAPSHOT.jar
```
and then:

```scala
com.databricks.spark.sql.perf.mllib.MLLib.run(yamlFile="src/main/scala/configs/mllib-small.yaml")
```

Author: Timothy Hunter <timhunter@databricks.com>

Closes #69 from thunterdb/1605-mllib2.
2016-06-22 16:59:49 -07:00
Josh Rosen
7e38b77c50 Update to compile against Spark 2.0.0-SNAPSHOT and bump version to 0.4.0-SNAPSHOT
Author: Josh Rosen <rosenville@gmail.com>

Closes #51 from JoshRosen/spark-2.0.0.
2016-02-19 13:02:29 -08:00
Michael Armbrust
9d3347e949 Improvements to running the benchmark
- Scripts for running the benchmark either while working on spark-sql-perf (bin/run) or while working on Spark (bin/spark-perf).  The latter uses Spark's sbt build to compile spark and downloads the most recent published version of spark-sql-perf.
 - Adds a `--compare` that can be used to compare the results with a baseline run

Author: Michael Armbrust <michael@databricks.com>

Closes #49 from marmbrus/runner.
2016-01-24 20:24:54 -08:00
Michael Armbrust
43f7457d03 Add required developer info to pom 2016-01-19 13:03:31 -08:00
Michael Armbrust
9afabf249a remove sql dependency 2016-01-19 12:52:03 -08:00
Michael Armbrust
663ca7560e Main Class for running Benchmarks from the command line
This PR adds the ability to run performance test locally as a stand alone program that reports the results to the console:

```
$ bin/run --help
spark-sql-perf 0.2.0
Usage: spark-sql-perf [options]

  -b <value> | --benchmark <value>
        the name of the benchmark to run
  -f <value> | --filter <value>
        a filter on the name of the queries to run
  -i <value> | --iterations <value>
        the number of iterations to run
  --help
        prints this usage text

$ bin/run --benchmark DatasetPerformance
```

Author: Michael Armbrust <michael@databricks.com>

Closes #47 from marmbrus/MainClass.
2016-01-19 12:37:51 -08:00
Michael Armbrust
5c93fff323 Upgrade to 1.6
Author: Michael Armbrust <michael@databricks.com>

Closes #48 from marmbrus/upgrade.
2016-01-18 09:11:35 -08:00
Michael Armbrust
7825449eef Include publishing to BinTray in release process
After this you should be able to use the library in the shell as follows:

```
bin/spark-shell --packages com.databricks:spark-sql-perf:0.2.3
```

Author: Michael Armbrust <michael@databricks.com>

Closes #46 from marmbrus/publishToMaven.
2015-12-23 00:09:35 -08:00
Michael Armbrust
f8aa93d968 Initial set of tests for Datasets
Author: Michael Armbrust <michael@databricks.com>

Closes #42 from marmbrus/dataset-tests.
2015-12-08 16:04:42 -08:00
Michael Armbrust
e516e1e7b3 Use published preview release of 1.6
Author: Michael Armbrust <michael@databricks.com>

Closes #32 from marmbrus/spark16.
2015-11-16 22:46:36 -08:00
Michael Armbrust
344b31ed69 Update to Spark 1.6
Some internal interfaces changed, so we need to bump the Spark version to run tests on Spark 1.6.

Author: Michael Armbrust <michael@databricks.com>

Closes #29 from marmbrus/spark16.
2015-11-13 12:40:00 -08:00
Michael Armbrust
8b441c1ee2 Update build.sbt 2015-09-11 12:16:55 -07:00
Michael Armbrust
479e4081c2 Add a release process for pushing to DBC 2015-09-09 22:32:31 -07:00
Michael Armbrust
e046705e7f update version 2015-08-24 16:14:17 -07:00
Michael Armbrust
98dd76befd Release 0.1.1 2015-08-24 16:13:51 -07:00
Michael Armbrust
e5ac7f6b4a update version 0.1.1-SNAPSHOT 2015-08-23 13:45:01 -07:00
Michael Armbrust
cabbf7291c release 0.1 2015-08-23 13:44:23 -07:00
Michael Armbrust
00aa49e8e4 Add support for CPU Profiling. 2015-08-20 16:46:12 -07:00
Michael Armbrust
eba8cea93c Basic join performance tests 2015-07-13 16:20:36 -07:00
Yin Huai
930751810e Initial port. 2015-04-15 20:03:14 -07:00