spark-sql-perf

Author	SHA1	Message	Date
Nico Poggi	0367ff65a6	Coalesce(n) instead of hardcoded (1) for large tables/partitions When using `clusterByPartitionColumns` coalesce into multiple files (instead of hardcoded 1) when the number of records is larger than `spark.sql.files.maxRecordsPerFile`. It has the cost of a count operation, but enables multiple writers for large scale factors. This improves cluster utilization and reduces total generation time, in especial to non-partitioned tables (or with few partitions i.e., TPCH) and large scale factors. The cost of count is leveraged as partitioning the data is needed anyway for the multiple writers, skipping a stage. Additionaly updates deprecated `registerTempTable` to `createOrReplaceTempView` to avoid warnings.	2018-10-16 11:21:05 +02:00
Nico Poggi	3c1c9e9070	Rebase for PR 87: Add -m for custom master, use SBT_HOME if set (#169 ) * Add -m for custom master * Add ability to use own sbt jar, update readme to include -m option * Add stddev percentage showing	2018-09-17 15:18:16 +02:00
Phil	d9a41a1204	Fix 3 local benchmark classes (#165 ) * Fix AggregationPerformance * Fix JoinPerformance * Fix average computation for datasets * Add explanation and usage for local benchmarks	2018-09-17 14:08:56 +02:00
Nico Poggi	aac7eb54c1	Fixing TPCH DDL datatype of `customer.c_nationkey` from `string` to `long` (#167 )	2018-09-13 12:00:29 +02:00
Piotr Mrówczyński	56f73482d7	Update Scala Logging to officially supported one	2018-09-11 12:17:06 +02:00
Nico Poggi	6136ecea6e	TPC-H datagenerator and instructions (#136 ) * Adding basic partitioning to TPCH tables following VectorH paper as baseline * Multi datagen (TPC- H and DS) and multi scale factor notebook/script. Generates all the selected scale factors and benchmarks in one run. * TPCH runner notebook or script for spark-shell * Adding basic TPCH documentation	2018-09-10 23:18:33 +02:00
Nico Poggi	8bbeae664d	Adds an optional version of the SS_MAX query (#137 ) * Adds an optional `ss_max` query without the count distinct to avoid a `hashAggregate` and make it more I/O intensive. * File rename (removed underscore) * Removing duplicate file `ss_max_b.sql` as suggested by review.	2018-09-10 22:54:02 +02:00
Nico Poggi	bf55bdb987	Make queryNames public so it can be accessed from notebooks. (#166 )	2018-09-10 22:53:20 +02:00
Xiangrui Meng	bb12958874	Fix compile for Spark 2.4 SNAPSHOT and only catch NonFatal (#164 ) * only catch non-fatal exceptions * remove afterBenchmark for MLlib * fix compile * use Apache snapshot releases	2018-09-10 08:49:31 -07:00
Liang Zhang	0ab6bf606b	Benchmark for SparkR UDF *apply() APIs Add a benchmark for SparkR `spark.lapply, dapply/dapplyCollect, gapply/gapplyCollect` APIs. Test on synthesized data with different types and sizes. Author: Liang Zhang <liang.zhang@databricks.com> Closes #163 from morewood/sparkr.	2018-07-12 17:12:35 -07:00
Bago Amirbekian	8e8c08d75b	[ML-4154] Added testing for before/after of ml benchmarks. (#162 ) This PR adds a unit tests which runs the beforeBenchmark & afterBenchmark methods for the benchmarks included in mllib-small.yaml.	2018-07-12 16:43:54 -07:00
Joseph Bradley	107495afe2	[ML-4069] Improve timing of estimators (#161 ) This gives the following running times: ``` recommendation.ALS 72.083s classification.DecisionTreeClassification 37.125s classification.DecisionTreeClassification 33.274s regression.DecisionTreeRegression 31.252s regression.DecisionTreeRegression 63.35s fpm.FPGrowth 6.219s fpm.FPGrowth 5.342s classification.GBTClassification 46.154s regression.GBTRegression 45.832s clustering.GaussianMixture 18.936s regression.GLMRegression 20.342s clustering.KMeans 32.473s clustering.LDA 44.574s clustering.LDA 24.658s classification.LinearSVC 39.84s regression.LinearRegression 43.335s classification.LogisticRegression 41.637s classification.LogisticRegression 37.711s classification.NaiveBayes 23.351s classification.RandomForestClassification 20.781s regression.RandomForestRegression 39.971s feature.Word2Vec 51.892s ```	2018-07-09 17:41:44 -07:00
Joseph Bradley	30c50dddbb	[ML-2918] Call count() in default score() to improve timing of transform() (#159 ) For Models and Transformers which are not tested with Evaluators, I think we are not timing transform() correctly here: spark-sql-perf/src/main/scala/com/databricks/spark/sql/perf/mllib/MLPipelineStageBenchmarkable.scala Line 65 in `aa1587f` transformer.transform(trainingData) Since transform() is lazy, we need to materialize it during timing. This PR currently just calls count() in the default implementation of score(). * call count() in score() * changed count to UDF	2018-07-08 16:09:24 -07:00
Xiangrui Meng	1798b12077	change large test timeout to 12 hours (#160 )	2018-07-04 15:32:00 -07:00
Xiangrui Meng	2895ae1139	update VectorAssembler test such that the dataset size is numExamples * numFeatures (#158 ) Currently the dataset size is numExamples * numFeatures * numInputCols, which is much bigger than other ML perf tests. This PR updates its implementation and makes it more efficient at slicing vectors. Tested on the mllib-big.yaml and 3 runs finished in < 2 minutes.	2018-07-03 17:16:36 -07:00
ludatabricks	e9ef9788c2	[ML-3844] Add GBTRegression benchmark (#156 ) * add GBTRegression benchmark * add GBTRegression benchmark	2018-06-27 09:17:38 -07:00
ludatabricks	e8aa132bb8	[ML-3870] Make spark-sql-perf master compiled with spark 2.3 and scala 2.11 (#155 ) Change the build config to update spark 2.3 and update the scala dependence in bin/spark-perf	2018-06-15 06:40:14 -07:00
ludatabricks	49717a72dd	put additionalTests to mlmetrics (#153 ) The time for additionalTests is missing in MLMetrics. Now add it back to MLMetrics so that we can test the time for other methods.	2018-06-13 15:21:50 -07:00
ludatabricks	a4e1c790ba	[ML-3869] Make Quantilediscretizer work with spark-2.3 (#154 ) Add setOutputCol for Quantilediscretizer so that it works for spark-2.3. The code has been manually tested by change the spark version.	2018-06-13 15:19:52 -07:00
ludatabricks	51786921a6	[ML-3583] Add benchmarks to mllib-large.yaml for featurization (#152 ) Benchmark for featurization is added to mllib-large.yaml. Cannot run QuantileDiscretizer with spark 2.3. Leave this as future work: https://databricks.atlassian.net/browse/ML-3869	2018-06-12 17:31:30 -07:00
ludatabricks	aa1587fec5	[ML-3824] Add benchmarks to mllib-large.yaml for FPGrowth (#151 ) Benchmark for FPGrowth is added to mllib-large.yaml.	2018-06-12 13:10:12 -07:00
ludatabricks	6a45dc8a2d	[ML-3581] Add benchmarks to mllib-large.yaml for regression (#150 ) Benchmark for regression is added to mllib-large.yaml. DecisionTreeRegression, GLMRegression, LinearRegression, and RandomForestRegression are added. GBT, AFTSurvivalRegression, and IsotonicRegression are missing in spark-sql-perf.	2018-06-12 10:32:02 -07:00
ludatabricks	9ab2a8bb14	[ML-3585] Added benchmarks to mllib-large.yaml for clustering (#149 ) Benchmark for clustering is added to mllib-large.yaml. GaussianMixture, KMeans, and LDA are added. BisectingKMeans is missing in spark-sql-perf now. Need to be fixed in the following up JIRA: https://databricks.atlassian.net/browse/ML-3834 Then parameters is based on the previous benchmarks for the Spark 2.2 QA.	2018-06-08 12:06:52 -07:00
ludatabricks	62b173d779	Output Training Time as metrics (#148 ) * change the structure of mlresult and add isLargerBetter * output training time, not scoreTrainTime	2018-06-07 13:21:32 -07:00
ludatabricks	d9984e1c0a	[ML-3584] Added benchmarks to mllib-large.yaml for ALS (#147 ) Benchmark for ALS is added to mllib-large.yaml. Then parameters is based on the previous benchmarks for the Spark 2.2 QA. It has been tested for the stability under the same cluster setting for other benchmarks (classifcation).	2018-06-07 08:11:37 -07:00
ludatabricks	93626c11b4	[ML-3775] Add "benchmarkId" to BenchmarkResult (#146 ) Add "benchmarkId" to BenchmarkResult, which is based on the benchmark name and a hashed value of params.	2018-06-04 14:13:45 -07:00
ludatabricks	f1139fc742	[ML-3753] Log "value" instead of "Some(value)" for ML params in results (#145 ) Change the function MLParams.toMap so it will not output a option value in the map. We will not get option value in the params in the output result. * change the "option" number to direct number * update based on the comments	2018-06-04 11:09:41 -07:00
ludatabricks	1768d376f9	[ML-3749] Log metric name and isLargerBetter in BenchmarkResult (#144 ) * Add new case class "MLMetric" to save all different metrics * Change "mlResult" in BenchmarkResult to Array[MLMetric] * score function will return MLMetric	2018-06-01 15:49:16 -07:00
Bago Amirbekian	789a0f5b8b	Added benchmarks to mllib-large.yaml for classifcation Estimators. (#143 )	2018-05-30 08:18:49 -07:00
WeichenXu	3786a8391e	Quantile discretizer benchmark (#135 ) QuantileDiscretizer benchmark	2018-05-17 11:55:00 -07:00
Bago Amirbekian	15d9283473	Run mllib small in unit tests (#141 ) * Add mllib unit test to run mllib-small.yaml. * Check results in unit tests and fail tests if failures are present. * Rename to be MLLibSuite instead of MLLibTest. * Better error message on failed benchmarks.	2018-05-09 16:24:30 -07:00
Bago Amirbekian	9ece11ff20	Add decision tree benchmark (#140 ) * Move mllib config file to resources. * Add DecisionTreeClassification as first benchmark in mllib-large.yaml. * Read config files as streams to be jar compatible. * PR feedback #140.	2018-05-08 21:44:11 -07:00
Joseph Bradley	ed9bbb01a5	fix bug with ML additional method tests (#142 ) #139 introduced a bug which made most ML tests fail with mllib-small.yaml. This fixes those tests.	2018-05-08 13:23:22 -07:00
WeichenXu	be4459fe41	Additional method test for some ML algos (#139 ) Add additional method test for some ML algos. In this PR, I add `associationRules` in `FPGrowth` and `findSynonyms`. After the design is accepted, I will add other methods later. Add an interface in `BenchmarkableAlgorithm`: ``` def testAdditionalMethods(ctx: MLBenchContext, model: Transformer): Map[String, () => _] ```	2018-05-02 13:45:58 -07:00
WeichenXu	5af9f6dfc2	Word2Vec benchmark (#127 ) * init pr * update * use builtin split fun	2018-03-15 13:10:04 -07:00
Juliusz Sompolski	a8acd53fdd	Use DECIMAL and DATE in the default TPCDS notebooks. (#130 ) A long time ago, DECIMAL was substituted by DOUBLE and DATE by STRING to workaround some problems. There is no reason to do it anymore.	2018-03-07 21:44:42 +01:00
Juliusz Sompolski	b7ac7e55ae	Remove VACUUM from tpcds_datagen notebook. (#129 )	2018-03-07 15:36:27 +01:00
WeichenXu	93a34553f0	MinHashLSH and BucketedRandomProjectionLSH benchmark #128 MinHashLSH and BucketedRandomProjectionLSH benchmark added. Future questions: * Whether we need to improve the way of testing data generation for MinHashLSH ( and add more control param, such as max/min element number in each input set ) * Whether we need to add benchmark for approxNearestNeighbors and approxSimilarityJoin	2018-03-02 15:21:37 -08:00
Bago Amirbekian	6d01ac94a1	[ML-3342] Bug fixes to make mllib benchmarks work with dbr-4.0. (#125 ) In spark 2.3 some default param values were moved from Models to matching Estimators. I added explicit sets for these values in our tests to avoid errors. Also renamed ModelBuilder to ModelBuildersSSP to avoid a name conflict with dbml-local which is included in databricks runtime.	2018-03-02 09:12:38 -08:00
Juliusz Sompolski	91604a3ab0	Update README to specify that TPCDS kit needs to be installed on all nodes.	2018-02-27 12:06:12 +01:00
Juliusz Sompolski	31f34beee5	Update README to do sql("use database") (#123 )	2017-11-07 20:38:26 +01:00
Juliusz Sompolski	7bf2d45b0f	Don't clean blocks after every run in Benchmarkable (#119 ) Do not clean blocks after each run in the generic Benchmarkable trait. It seems to have been there since #33, and an option spark.databricks.benchmark.cleanBlocksAfter to turn it off was added to it in #98, specifically to allow parallel TPCDS streams to not wipe each other's blocks. But that option is quite well hidden and obscure, and as a SparkContext config option can only be set during cluster creation, so it's not friendly to use. Cleaning up the blocks doesn't seem necessary for the Query Benchmarkables used for TPCDS and TPCH. Remove it from there, and leave it only for MLPipelineStageBenchmarkable.	2017-09-18 11:51:12 +02:00
Juliusz Sompolski	fdd0e38717	TPCDS notebooks in source, not binary format (#121 )	2017-09-13 14:57:59 +02:00
Nico Poggi	006f096562	Merge pull request #120 from juliuszsompolski/tpcds_notebooks Add example notebooks for running TPCDS and update readme	2017-09-12 17:22:38 +02:00
Juliusz Sompolski	5ebb9cfb12	add some more comments	2017-09-12 16:51:26 +02:00
Juliusz Sompolski	c78f2b3a9b	update readme	2017-09-12 16:40:23 +02:00
Juliusz Sompolski	ae8bcdb292	add notebooks	2017-09-12 15:43:08 +02:00
WeichenXu	f08bf31d18	add benchmark for FPGrowth (#113 ) Note: Add a `ItemSetGenerator` class, use following algo: 1.Create P=`numItems` items (integers 0 to P-1) 2. Generate `numExample` rows, where each row (an itemset) is selected as follows: 2.1 Choose the size of the itemset from a Poisson distribution 2.2 Generate `size - 2` items by choosing integers from a Poisson distribution. Eliminate duplicates as needed. 2.3 Add 2 new items in order to create actual association rules. 2.3.1 For each itemset, pick the first item, and compute a new item = (firstItem + P / 2) % P, add new item to the set. 2.3.2 For each itemset, pick the first 2 items (integers) and add them together (modulo P) to compute a new item to add to the set.	2017-09-04 10:48:05 -07:00
Juliusz Sompolski	bcda8fc1e5	Coalesce non-partitioned tables. (#118 ) In #109 coalescing of non-partitioned tables into 1 file seems to have gotten accidentally removed. Put it back, but only when clusterByPartitionedColumns == true Considering that we coalesce partitions only when that setting is true, it seems to be consistent to use it also for non-partitioned tables. It may be better to change the name of the parameter, but that changes the interface, and possibly should be left for some future clean up.	2017-09-04 18:05:42 +02:00
Siddharth Murching	3e1bbd00ed	[ML-2847] Add new tests for (DecisionTree, RandomForest)Regression, GMM, HashingTF (#116 ) This PR follows up on #112, adding new performance tests for DecisionTreeRegression, RandomForestRegression, GMM, and HashingTF. Summary of changes: * Added new performance tests * Updated configs in mllib-small.yaml Alphabetized configs Added new configs for: RandomForestRegression, DecisionTreeRegression, GMM, HashingTF * Refactored TreeOrForestClassification into a trait (TreeOrForestEstimator) exposing methods for all tree/forest estimator performance tests. ** Copied code from DecisionTreeClassification.scala into TreeOrForestEstimator.scala I tested this PR by running the performance tests specified in mllib-small.yaml	2017-09-03 22:26:20 -07:00

1 2 3 4 5

225 Commits