Commit Graph

231 Commits

Author SHA1 Message Date
Nico Poggi
d85f75bb38
Update for Spark 3.0.0 compatibility (#191)
* Updating the build file to spark 3.0.0 and scala 2.12.10
* Fixing incompatibilities
* Adding default parameters to newer required functions
* Removing HiveTest
2020-11-03 15:27:34 +01:00
Guo Chenzhao
6b2bf9f9ad Fix files truncating according to maxRecordPerFile (#180)
* Fix files truncating according to maxRecordPerFile

* toDouble
2019-05-29 23:20:01 +08:00
Nico Poggi
3f92a094cc
Bumping version to 0.5.1-SNAPSHOT (spark 3, scala 2.12, log4j ) (#168) 2019-01-29 10:00:54 +01:00
Luca Canali
e1e1365a87 Updates for Spark 3.0 and Scala 2.12 compatibility (#176)
* Refactor deprecated `getOrCreate()` in spark 3
* Compile with scala 2.12
* Updated usage related to obsolete/deprecated features
* remove use of scala-logging replaced by using slf4j directly
2019-01-29 09:58:52 +01:00
Bago Amirbekian
85bbfd4ca2 [ML-5437] Build with spark-2.4.0 and resolve build issues (#174)
We made some changes to related to new APIs in spark2.4. These APIs were reverted because they were breaking changes so we need to revert our changes.
2018-11-09 16:21:22 -08:00
Nico Poggi
d44caec277
Revert "Update Scala Logging to officially supported one " (#172)
Reverts #157 due to library errors when the previous was is in the classpath already (i.e., in databricks) and not bringing any noted improvements or needed fixes. Exception:
java.lang.InstantiationError: com.typesafe.scalalogging.Logger
This reverts commit 56f7348.
2018-10-19 17:33:34 +02:00
Nico Poggi
0367ff65a6
Coalesce(n) instead of hardcoded (1) for large tables/partitions
When using `clusterByPartitionColumns` coalesce into multiple files (instead of hardcoded 1) when the number of records is larger than `spark.sql.files.maxRecordsPerFile`.  It has the cost of a count operation, but enables multiple writers for large scale factors.  This improves cluster utilization and reduces total generation time, in especial to non-partitioned tables (or with few partitions i.e., TPCH) and large scale factors.  The cost of count is leveraged as partitioning the data is needed anyway for the multiple writers, skipping a stage.
Additionaly updates deprecated `registerTempTable` to `createOrReplaceTempView` to avoid warnings.
2018-10-16 11:21:05 +02:00
Nico Poggi
3c1c9e9070
Rebase for PR 87: Add -m for custom master, use SBT_HOME if set (#169)
* Add -m for custom master
* Add ability to use own sbt jar, update readme to include -m option
* Add stddev percentage showing
2018-09-17 15:18:16 +02:00
Phil
d9a41a1204 Fix 3 local benchmark classes (#165)
* Fix AggregationPerformance
* Fix JoinPerformance
* Fix average computation for datasets
* Add explanation and usage for local benchmarks
2018-09-17 14:08:56 +02:00
Nico Poggi
aac7eb54c1
Fixing TPCH DDL datatype of customer.c_nationkey from string to long (#167) 2018-09-13 12:00:29 +02:00
Piotr Mrówczyński
56f73482d7 Update Scala Logging to officially supported one 2018-09-11 12:17:06 +02:00
Nico Poggi
6136ecea6e
TPC-H datagenerator and instructions (#136)
* Adding basic partitioning to TPCH tables following VectorH paper as baseline
* Multi datagen (TPC- H and DS) and multi scale factor notebook/script.
Generates all the selected scale factors and benchmarks in one run.
* TPCH runner notebook or script for spark-shell
* Adding basic TPCH documentation
2018-09-10 23:18:33 +02:00
Nico Poggi
8bbeae664d
Adds an optional version of the SS_MAX query (#137)
* Adds an optional `ss_max` query without the count distinct to avoid a `hashAggregate` and make it more I/O intensive.

* File rename (removed underscore)

* Removing duplicate file `ss_max_b.sql` as suggested by review.
2018-09-10 22:54:02 +02:00
Nico Poggi
bf55bdb987
Make queryNames public so it can be accessed from notebooks. (#166) 2018-09-10 22:53:20 +02:00
Xiangrui Meng
bb12958874
Fix compile for Spark 2.4 SNAPSHOT and only catch NonFatal (#164)
* only catch non-fatal exceptions

* remove afterBenchmark for MLlib

* fix compile

* use Apache snapshot releases
2018-09-10 08:49:31 -07:00
Liang Zhang
0ab6bf606b Benchmark for SparkR UDF *apply() APIs
Add a benchmark for SparkR `spark.lapply, dapply/dapplyCollect, gapply/gapplyCollect` APIs. Test on synthesized data with different types and sizes.

Author: Liang Zhang <liang.zhang@databricks.com>

Closes #163 from morewood/sparkr.
2018-07-12 17:12:35 -07:00
Bago Amirbekian
8e8c08d75b [ML-4154] Added testing for before/after of ml benchmarks. (#162)
This PR adds a unit tests which runs the beforeBenchmark & afterBenchmark methods for the benchmarks included in mllib-small.yaml.
2018-07-12 16:43:54 -07:00
Joseph Bradley
107495afe2
[ML-4069] Improve timing of estimators (#161)
This gives the following running times:
```
recommendation.ALS	72.083s
classification.DecisionTreeClassification	37.125s
classification.DecisionTreeClassification	33.274s
regression.DecisionTreeRegression	31.252s
regression.DecisionTreeRegression	63.35s
fpm.FPGrowth	6.219s
fpm.FPGrowth	5.342s
classification.GBTClassification	46.154s
regression.GBTRegression	45.832s
clustering.GaussianMixture	18.936s
regression.GLMRegression	20.342s
clustering.KMeans	32.473s
clustering.LDA	44.574s
clustering.LDA	24.658s
classification.LinearSVC	39.84s
regression.LinearRegression	43.335s
classification.LogisticRegression	41.637s
classification.LogisticRegression	37.711s
classification.NaiveBayes	23.351s
classification.RandomForestClassification	20.781s
regression.RandomForestRegression	39.971s
feature.Word2Vec	51.892s
```
2018-07-09 17:41:44 -07:00
Joseph Bradley
30c50dddbb [ML-2918] Call count() in default score() to improve timing of transform() (#159)
For Models and Transformers which are not tested with Evaluators, I think we are not timing transform() correctly here:

spark-sql-perf/src/main/scala/com/databricks/spark/sql/perf/mllib/MLPipelineStageBenchmarkable.scala

Line 65 in aa1587f

 transformer.transform(trainingData) 
Since transform() is lazy, we need to materialize it during timing. This PR currently just calls count() in the default implementation of score().

* call count() in score()
* changed count to UDF
2018-07-08 16:09:24 -07:00
Xiangrui Meng
1798b12077
change large test timeout to 12 hours (#160) 2018-07-04 15:32:00 -07:00
Xiangrui Meng
2895ae1139 update VectorAssembler test such that the dataset size is numExamples * numFeatures (#158)
Currently the dataset size is numExamples * numFeatures * numInputCols, which is much bigger than other ML perf tests. This PR updates its implementation and makes it more efficient at slicing vectors.

Tested on the mllib-big.yaml and 3 runs finished in < 2 minutes.
2018-07-03 17:16:36 -07:00
ludatabricks
e9ef9788c2 [ML-3844] Add GBTRegression benchmark (#156)
* add GBTRegression benchmark

* add GBTRegression benchmark
2018-06-27 09:17:38 -07:00
ludatabricks
e8aa132bb8 [ML-3870] Make spark-sql-perf master compiled with spark 2.3 and scala 2.11 (#155)
Change the build config to update spark 2.3 and update the scala dependence in bin/spark-perf
2018-06-15 06:40:14 -07:00
ludatabricks
49717a72dd put additionalTests to mlmetrics (#153)
The time for additionalTests is missing in MLMetrics. Now add it back to MLMetrics so that we can test the time for other methods.
2018-06-13 15:21:50 -07:00
ludatabricks
a4e1c790ba [ML-3869] Make Quantilediscretizer work with spark-2.3 (#154)
Add setOutputCol for Quantilediscretizer so that it works for spark-2.3.

The code has been manually tested by change the spark version.
2018-06-13 15:19:52 -07:00
ludatabricks
51786921a6 [ML-3583] Add benchmarks to mllib-large.yaml for featurization (#152)
Benchmark for featurization is added to mllib-large.yaml.
Cannot run QuantileDiscretizer with spark 2.3. Leave this as future work:
https://databricks.atlassian.net/browse/ML-3869
2018-06-12 17:31:30 -07:00
ludatabricks
aa1587fec5 [ML-3824] Add benchmarks to mllib-large.yaml for FPGrowth (#151)
Benchmark for FPGrowth is added to mllib-large.yaml.
2018-06-12 13:10:12 -07:00
ludatabricks
6a45dc8a2d [ML-3581] Add benchmarks to mllib-large.yaml for regression (#150)
Benchmark for regression is added to mllib-large.yaml.
DecisionTreeRegression, GLMRegression, LinearRegression, and RandomForestRegression are added.

GBT, AFTSurvivalRegression, and IsotonicRegression are missing in spark-sql-perf.
2018-06-12 10:32:02 -07:00
ludatabricks
9ab2a8bb14 [ML-3585] Added benchmarks to mllib-large.yaml for clustering (#149)
Benchmark for clustering is added to mllib-large.yaml.
GaussianMixture, KMeans, and LDA are added. BisectingKMeans is missing in spark-sql-perf now. Need to be fixed in the following up JIRA: https://databricks.atlassian.net/browse/ML-3834
Then parameters is based on the previous benchmarks for the Spark 2.2 QA.
2018-06-08 12:06:52 -07:00
ludatabricks
62b173d779 Output Training Time as metrics (#148)
* change the structure of mlresult and add isLargerBetter

* output training time, not scoreTrainTime
2018-06-07 13:21:32 -07:00
ludatabricks
d9984e1c0a [ML-3584] Added benchmarks to mllib-large.yaml for ALS (#147)
Benchmark for ALS is added to mllib-large.yaml.
Then parameters is based on the previous benchmarks for the Spark 2.2 QA. It has been tested for the stability under the same cluster setting for other benchmarks (classifcation).
2018-06-07 08:11:37 -07:00
ludatabricks
93626c11b4 [ML-3775] Add "benchmarkId" to BenchmarkResult (#146)
Add "benchmarkId" to BenchmarkResult, which is based on the benchmark name and a hashed value of params.
2018-06-04 14:13:45 -07:00
ludatabricks
f1139fc742 [ML-3753] Log "value" instead of "Some(value)" for ML params in results (#145)
Change the function MLParams.toMap so it will not output a option value in the map.
We will not get option value in the params in the output result.

* change the "option" number to direct number
* update based on the comments
2018-06-04 11:09:41 -07:00
ludatabricks
1768d376f9 [ML-3749] Log metric name and isLargerBetter in BenchmarkResult (#144)
* Add new case class "MLMetric" to save all different metrics
* Change "mlResult" in BenchmarkResult to Array[MLMetric]
* score function will return MLMetric
2018-06-01 15:49:16 -07:00
Bago Amirbekian
789a0f5b8b Added benchmarks to mllib-large.yaml for classifcation Estimators. (#143) 2018-05-30 08:18:49 -07:00
WeichenXu
3786a8391e Quantile discretizer benchmark (#135)
QuantileDiscretizer benchmark
2018-05-17 11:55:00 -07:00
Bago Amirbekian
15d9283473 Run mllib small in unit tests (#141)
* Add mllib unit test to run mllib-small.yaml.

* Check results in unit tests and fail tests if failures are present.

* Rename to be MLLibSuite instead of MLLibTest.

* Better error message on failed benchmarks.
2018-05-09 16:24:30 -07:00
Bago Amirbekian
9ece11ff20 Add decision tree benchmark (#140)
* Move mllib config file to resources.

* Add DecisionTreeClassification as first benchmark in mllib-large.yaml.

* Read config files as streams to be jar compatible.

* PR feedback #140.
2018-05-08 21:44:11 -07:00
Joseph Bradley
ed9bbb01a5 fix bug with ML additional method tests (#142)
#139 introduced a bug which made most ML tests fail with mllib-small.yaml. This fixes those tests.
2018-05-08 13:23:22 -07:00
WeichenXu
be4459fe41 Additional method test for some ML algos (#139)
Add additional method test for some ML algos.

In this PR, I add `associationRules` in `FPGrowth` and `findSynonyms`. 
After the design is accepted, I will add other methods later.

Add an interface in `BenchmarkableAlgorithm`:
```
  def testAdditionalMethods(ctx: MLBenchContext, model: Transformer): Map[String, () => _]
```
2018-05-02 13:45:58 -07:00
WeichenXu
5af9f6dfc2 Word2Vec benchmark (#127)
* init pr

* update

* use builtin split fun
2018-03-15 13:10:04 -07:00
Juliusz Sompolski
a8acd53fdd
Use DECIMAL and DATE in the default TPCDS notebooks. (#130)
A long time ago, DECIMAL was substituted by DOUBLE and DATE by STRING to workaround some problems.
There is no reason to do it anymore.
2018-03-07 21:44:42 +01:00
Juliusz Sompolski
b7ac7e55ae
Remove VACUUM from tpcds_datagen notebook. (#129) 2018-03-07 15:36:27 +01:00
WeichenXu
93a34553f0 MinHashLSH and BucketedRandomProjectionLSH benchmark #128
MinHashLSH and BucketedRandomProjectionLSH benchmark added.

Future questions:
* Whether we need to improve the way of testing data generation for MinHashLSH ( and add more control param, such as max/min element number in each input set )
* Whether we need to add benchmark for approxNearestNeighbors and approxSimilarityJoin
2018-03-02 15:21:37 -08:00
Bago Amirbekian
6d01ac94a1 [ML-3342] Bug fixes to make mllib benchmarks work with dbr-4.0. (#125)
In spark 2.3 some default param values were moved from Models to matching Estimators. I added explicit sets for these values in our tests to avoid errors. Also renamed ModelBuilder to ModelBuildersSSP to avoid a name conflict with dbml-local which is included in databricks runtime.
2018-03-02 09:12:38 -08:00
Juliusz Sompolski
91604a3ab0 Update README to specify that TPCDS kit needs to be installed on all nodes. 2018-02-27 12:06:12 +01:00
Juliusz Sompolski
31f34beee5
Update README to do sql("use database") (#123) 2017-11-07 20:38:26 +01:00
Juliusz Sompolski
7bf2d45b0f Don't clean blocks after every run in Benchmarkable (#119)
Do not clean blocks after each run in the generic Benchmarkable trait.
It seems to have been there since #33, and an option spark.databricks.benchmark.cleanBlocksAfter to turn it off was added to it in #98, specifically to allow parallel TPCDS streams to not wipe each other's blocks. But that option is quite well hidden and obscure, and as a SparkContext config option can only be set during cluster creation, so it's not friendly to use.

Cleaning up the blocks doesn't seem necessary for the Query Benchmarkables used for TPCDS and TPCH. Remove it from there, and leave it only for MLPipelineStageBenchmarkable.
2017-09-18 11:51:12 +02:00
Juliusz Sompolski
fdd0e38717 TPCDS notebooks in source, not binary format (#121) 2017-09-13 14:57:59 +02:00
Nico Poggi
006f096562 Merge pull request #120 from juliuszsompolski/tpcds_notebooks
Add example notebooks for running TPCDS and update readme
2017-09-12 17:22:38 +02:00