* Refactor deprecated `getOrCreate()` in spark 3
* Compile with scala 2.12
* Updated usage related to obsolete/deprecated features
* remove use of scala-logging replaced by using slf4j directly
Reverts #157 due to library errors when the previous was is in the classpath already (i.e., in databricks) and not bringing any noted improvements or needed fixes. Exception:
java.lang.InstantiationError: com.typesafe.scalalogging.Logger
This reverts commit 56f7348.
When using `clusterByPartitionColumns` coalesce into multiple files (instead of hardcoded 1) when the number of records is larger than `spark.sql.files.maxRecordsPerFile`. It has the cost of a count operation, but enables multiple writers for large scale factors. This improves cluster utilization and reduces total generation time, in especial to non-partitioned tables (or with few partitions i.e., TPCH) and large scale factors. The cost of count is leveraged as partitioning the data is needed anyway for the multiple writers, skipping a stage.
Additionaly updates deprecated `registerTempTable` to `createOrReplaceTempView` to avoid warnings.
* Adding basic partitioning to TPCH tables following VectorH paper as baseline
* Multi datagen (TPC- H and DS) and multi scale factor notebook/script.
Generates all the selected scale factors and benchmarks in one run.
* TPCH runner notebook or script for spark-shell
* Adding basic TPCH documentation
* Adds an optional `ss_max` query without the count distinct to avoid a `hashAggregate` and make it more I/O intensive.
* File rename (removed underscore)
* Removing duplicate file `ss_max_b.sql` as suggested by review.
Add a benchmark for SparkR `spark.lapply, dapply/dapplyCollect, gapply/gapplyCollect` APIs. Test on synthesized data with different types and sizes.
Author: Liang Zhang <liang.zhang@databricks.com>
Closes#163 from morewood/sparkr.
For Models and Transformers which are not tested with Evaluators, I think we are not timing transform() correctly here:
spark-sql-perf/src/main/scala/com/databricks/spark/sql/perf/mllib/MLPipelineStageBenchmarkable.scala
Line 65 in aa1587f
transformer.transform(trainingData)
Since transform() is lazy, we need to materialize it during timing. This PR currently just calls count() in the default implementation of score().
* call count() in score()
* changed count to UDF
Currently the dataset size is numExamples * numFeatures * numInputCols, which is much bigger than other ML perf tests. This PR updates its implementation and makes it more efficient at slicing vectors.
Tested on the mllib-big.yaml and 3 runs finished in < 2 minutes.
Benchmark for regression is added to mllib-large.yaml.
DecisionTreeRegression, GLMRegression, LinearRegression, and RandomForestRegression are added.
GBT, AFTSurvivalRegression, and IsotonicRegression are missing in spark-sql-perf.
Benchmark for clustering is added to mllib-large.yaml.
GaussianMixture, KMeans, and LDA are added. BisectingKMeans is missing in spark-sql-perf now. Need to be fixed in the following up JIRA: https://databricks.atlassian.net/browse/ML-3834
Then parameters is based on the previous benchmarks for the Spark 2.2 QA.
Benchmark for ALS is added to mllib-large.yaml.
Then parameters is based on the previous benchmarks for the Spark 2.2 QA. It has been tested for the stability under the same cluster setting for other benchmarks (classifcation).
Change the function MLParams.toMap so it will not output a option value in the map.
We will not get option value in the params in the output result.
* change the "option" number to direct number
* update based on the comments
* Add new case class "MLMetric" to save all different metrics
* Change "mlResult" in BenchmarkResult to Array[MLMetric]
* score function will return MLMetric
* Add mllib unit test to run mllib-small.yaml.
* Check results in unit tests and fail tests if failures are present.
* Rename to be MLLibSuite instead of MLLibTest.
* Better error message on failed benchmarks.
* Move mllib config file to resources.
* Add DecisionTreeClassification as first benchmark in mllib-large.yaml.
* Read config files as streams to be jar compatible.
* PR feedback #140.
Add additional method test for some ML algos.
In this PR, I add `associationRules` in `FPGrowth` and `findSynonyms`.
After the design is accepted, I will add other methods later.
Add an interface in `BenchmarkableAlgorithm`:
```
def testAdditionalMethods(ctx: MLBenchContext, model: Transformer): Map[String, () => _]
```
MinHashLSH and BucketedRandomProjectionLSH benchmark added.
Future questions:
* Whether we need to improve the way of testing data generation for MinHashLSH ( and add more control param, such as max/min element number in each input set )
* Whether we need to add benchmark for approxNearestNeighbors and approxSimilarityJoin
In spark 2.3 some default param values were moved from Models to matching Estimators. I added explicit sets for these values in our tests to avoid errors. Also renamed ModelBuilder to ModelBuildersSSP to avoid a name conflict with dbml-local which is included in databricks runtime.
Do not clean blocks after each run in the generic Benchmarkable trait.
It seems to have been there since #33, and an option spark.databricks.benchmark.cleanBlocksAfter to turn it off was added to it in #98, specifically to allow parallel TPCDS streams to not wipe each other's blocks. But that option is quite well hidden and obscure, and as a SparkContext config option can only be set during cluster creation, so it's not friendly to use.
Cleaning up the blocks doesn't seem necessary for the Query Benchmarkables used for TPCDS and TPCH. Remove it from there, and leave it only for MLPipelineStageBenchmarkable.