spark-sql-perf

Author	SHA1	Message	Date
Nico Poggi	006f096562	Merge pull request #120 from juliuszsompolski/tpcds_notebooks Add example notebooks for running TPCDS and update readme	2017-09-12 17:22:38 +02:00
Juliusz Sompolski	5ebb9cfb12	add some more comments	2017-09-12 16:51:26 +02:00
Juliusz Sompolski	c78f2b3a9b	update readme	2017-09-12 16:40:23 +02:00
Juliusz Sompolski	ae8bcdb292	add notebooks	2017-09-12 15:43:08 +02:00
WeichenXu	f08bf31d18	add benchmark for FPGrowth (#113 ) Note: Add a `ItemSetGenerator` class, use following algo: 1.Create P=`numItems` items (integers 0 to P-1) 2. Generate `numExample` rows, where each row (an itemset) is selected as follows: 2.1 Choose the size of the itemset from a Poisson distribution 2.2 Generate `size - 2` items by choosing integers from a Poisson distribution. Eliminate duplicates as needed. 2.3 Add 2 new items in order to create actual association rules. 2.3.1 For each itemset, pick the first item, and compute a new item = (firstItem + P / 2) % P, add new item to the set. 2.3.2 For each itemset, pick the first 2 items (integers) and add them together (modulo P) to compute a new item to add to the set.	2017-09-04 10:48:05 -07:00
Juliusz Sompolski	bcda8fc1e5	Coalesce non-partitioned tables. (#118 ) In #109 coalescing of non-partitioned tables into 1 file seems to have gotten accidentally removed. Put it back, but only when clusterByPartitionedColumns == true Considering that we coalesce partitions only when that setting is true, it seems to be consistent to use it also for non-partitioned tables. It may be better to change the name of the parameter, but that changes the interface, and possibly should be left for some future clean up.	2017-09-04 18:05:42 +02:00
Siddharth Murching	3e1bbd00ed	[ML-2847] Add new tests for (DecisionTree, RandomForest)Regression, GMM, HashingTF (#116 ) This PR follows up on #112, adding new performance tests for DecisionTreeRegression, RandomForestRegression, GMM, and HashingTF. Summary of changes: * Added new performance tests * Updated configs in mllib-small.yaml Alphabetized configs Added new configs for: RandomForestRegression, DecisionTreeRegression, GMM, HashingTF * Refactored TreeOrForestClassification into a trait (TreeOrForestEstimator) exposing methods for all tree/forest estimator performance tests. ** Copied code from DecisionTreeClassification.scala into TreeOrForestEstimator.scala I tested this PR by running the performance tests specified in mllib-small.yaml	2017-09-03 22:26:20 -07:00
WeichenXu	19c41464c7	fix df.drop in VectorAssembler (#117 ) fix df.drop in VectorAssembler to return correct DataFrame	2017-09-01 13:51:05 -07:00
WeichenXu	6ec83fd0f7	Add benchmark for LinearSVC/OnehotEncoder/VectorSlicer/VectorAssembler/StringIndexer/Tokenizer (#112 ) Add benchmark for: LinearSVC OnehotEncoder VectorSlicer VectorAssembler StringIndexer Tokenizer	2017-08-31 13:56:43 -07:00
Juliusz Sompolski	737a1bc355	BlockingLineStream (#115 ) ## What changes are proposed in this pull request? Investigating OOMs during TPCDS data generation: it turned out that scala standard library's ProcessBuilder.lineStream would by default create a LinkedBlockingQueue buffer of Integer.MAX_VALUE capacity. It surfaced after in https://github.com/databricks/tpcds-kit/pull/2 we implemented 10x improvements to dsdgen speed. Now spark-sql-perf does not keep up with ingesting data from dsdgen, and the buffer will cause OOMs. Pulled out pieces of ProcessBuilderImpl and ProcessImpl just to create a LinkedBlockingQueue with maxQueueSize=65536 instead. Also submitted https://github.com/scala/scala/pull/6052 ## How was this patch tested? - ssh on the worker - see that dsdgen is being throttled now, Java memory doesn't explode. - tested that TPCDS SF100 generated correctly.	2017-08-31 15:16:22 +02:00
Siddharth Murching	9febc34f66	Refactor MLParams for spark-sql-perf (#114 ) A case class (MLParams) is currently used to store/access parameters for ML tests in spark-sql-perf. With the addition of new ML tests to spark-sql-perf (in this PR: #112), the number of ML-related test params will be > 22, but Scala only allows up to 22 params in a case class. This PR addresses the issue by: * Introducing a new MLParameters class (class MLParameters) that provides access to the same parameters as MLParams, except as a class instead of a case class. * Replacing usages of MLParams with MLParameters * Storing the members of MLParameters in BenchmarkResult.parameters for logging/persistence. Tested by running default performance tests in src/main/scala/configs/mllib-small.yaml.	2017-08-28 13:23:59 -07:00
Siddharth Murching	d0de5ae8aa	Update tests to run with Spark 2.2, add NaiveBayes & Bucketizer ML tests (#110 ) * Made small updates in Benchmark.scala and Query.scala for Spark 2.2 * Added tests for NaiveBayesModel and Bucketizer * Changed BenchmarkAlgorithm.getEstimator() -> BenchmarkAlgorithm.getPipelineStage() to allow for the benchmarking of Estimators and Transformers instead of just Estimators Commits: * Changes made so that spark-sql-perf compiles with Spark 2.2 * Updates for running ML tests from the command line + added Naive Bayes test * Add Bucketizer test as example of Featurizer test; change getEstimator() to getPipelineStage() in BenchmarkAlgorithm to allow for testing of transformers in addition to estimators. * Add comment for main method in MLlib.scala * Rename MLTransformerBenchmarkable --> MLPipelineStageBenchmarkable, fix issue with NaiveBayes param * Add UnaryTransformer trait for common data/methods to be shared across all objects testing featurizers that operate on a single column (StringIndexer, OneHotEncoder, Bucketizer, HashingTF, etc) * Respond to review comments: * bin/run-ml: Add newline at EOF * Query.scala: organized imports * MLlib.scala: organized imports, fixed SparkContext initialization * NaiveBayes.scala: removed unused temp val, improved probability calculation in trueModel() * Bucketizer.scala: use DataGenerator.generateContinuousFeatures instead of generating data on the driver * Fix bug in Bucketizer.scala * Precompute log of sum of unnormalized probabilities in NaiveBayes.scala, add NaiveBayes and Bucketizer tests to mllib-small.yaml * Update Query.scala to use p() to access SparkPlans under a given SparkPlan * Update README to indicate that spark-sql-perf only works with Spark 2.2+ after this PR	2017-08-21 15:07:46 -07:00
Yin Huai	b3a6ed79b3	Start the development 0.5.0-SNAPSHOT	2017-08-21 14:21:19 -07:00
Bogdan Raducanu	4e7a2363b9	Support for TPC-H benchmark Refactored TPC-DS code to be able to reuse it for TPC-H. Added TPC-H queries texts adapted for Spark.	2017-08-09 12:26:32 +02:00
Kevin	fdcde7595c	Update README (#107 ) Little update for the README	2017-07-13 10:45:24 +02:00
Juliusz Sompolski	6488d74d23	tpcds_2_4: Add alias names to subqueries in FROM clause. ## What changes were proposed in this pull request? Since SPARK-20690 and SPARK-20916 Spark requires all subqueries in FROM clause to have an alias name. ## How was this patch tested? Tested on SF1.	2017-06-29 16:04:08 +02:00
Juliusz Sompolski	bff6b34f62	Tweaks and improvements (#106 ) Data generation: * Add an option to change Dates to Strings, and specify it in Tables object creator. * Add discovering partitions to createExternalTables * Add analyzeTables function that gathers statistics. Benchmark execution: * Perform collect() on Dataframe, so that it is recorded by SQL SparkUI.	2017-06-13 11:42:14 +02:00
Juliusz Sompolski	75f3876e59	Merge pull request #103 from juliuszsompolski/fixtypes Correct types of keys in schema	2017-05-26 11:53:19 +02:00
Juliusz Sompolski	2ddd521ab5	ok, make it long only where really needed.	2017-05-26 10:36:40 +02:00
Juliusz Sompolski	1bca964a3d	Correct types of keys	2017-05-25 17:12:47 +02:00
Volodymyr Lyubinets	beec62844d	Merge pull request #101 from vlyubin/master Add tpcds 2.4 queries	2017-05-16 10:35:35 +02:00
vlyubin	c0bd21c2ec	Add ss_max	2017-05-16 10:29:00 +02:00
vlyubin	e5dc6f338f	Updated queries 23	2017-05-15 17:30:20 +02:00
vlyubin	e8f85b0b0e	Moved queries into a separate folder	2017-05-15 14:22:37 +02:00
vlyubin	96bf10bffc	Add tpcds 2.4 queries	2017-05-12 11:54:32 +02:00
Eric Liang	c12b14b013	Merge pull request #98 from databricks/parallel-runs Add option to avoid cleaning after each run, to enable parallel runs	2017-03-15 13:50:41 -07:00
Eric Liang	64728c7cff	Add option to avoid cleaning after each run, to enable parallel runs	2017-03-14 19:45:27 -07:00
Timothy Hunter	53091a1935	Removes labels from tree data generation (#82 ) * changes * removes labels * reset scala version * adding metadata * bumping spark release	2016-12-13 16:47:31 -08:00
srinathshankar	685c50d9dc	Cross build with Scala 2.11 (#91 ) * Cross build with Scala 2.11 * Update snapshot version	2016-10-03 17:01:17 -07:00
srinathshankar	0eaa4b1d57	[SC-4409] Correct query 41 in TPCDS kit (#90 )	2016-09-30 18:02:39 -07:00
Josh Rosen	c2224f37e5	Depend on non-snapshot Spark now that 2.0.0 is released Now that Spark 2.0.0 is released, we need to update the build to use a released version instead of the snapshot (which is no longer available). Fixes #84. Author: Josh Rosen <joshrosen@databricks.com> Closes #85 from JoshRosen/fix-spark-dep.	2016-08-17 17:53:30 -07:00
Timothy Hunter	948c8369e7	Fixes issues with scala 2.11 Updates the usual scala-logging issues to make the source code cross-compilable between scala 2.10 and scala 2.11. Tests: A scala 2.11 version of the code has been run against the official Spark 2.0.0 RC4 binary release (Scala 2.11) A scala 2.10 version has been run against the official Spark 1.6.2 release Author: Timothy Hunter <timhunter@databricks.com> Closes #81 from thunterdb/1607-scala211.	2016-07-19 11:19:52 -07:00
Timothy Hunter	8830bffd46	Merge pull request #79 from jkbradley/tree-test-fix Fixed tree, forest, GBT tests by adding metadata to DataFrames	2016-07-11 10:42:19 -07:00
Joseph K. Bradley	51469a34d6	Fixed tree, forest, GBT tests by adding metadata to DataFrames	2016-07-11 10:33:19 -07:00
Timothy Hunter	1fcc366cec	Merge pull request #78 from thunterdb/1607-fixes Adding parameters in case of failures	2016-07-06 11:34:05 -07:00
Timothy Hunter	c7d42d3626	adding parameters	2016-07-06 11:23:07 -07:00
Timothy Hunter	2672bcd5b7	ALS algorithm for spark-sql-perf This has been tested locally with a small amount of data. I have not bothered to reimplement a more robust version of the ALS synthetic data generation, so it will still require some manual parameter tweaking as before. Author: Timothy Hunter <timhunter@databricks.com> Closes #76 from thunterdb/1607-als.	2016-07-05 15:54:08 -07:00
Timothy Hunter	93c0407bbe	Merge pull request #77 from thunterdb/1607-linear Linear regression	2016-07-05 15:41:35 -07:00
Timothy Hunter	40e97ca3c0	comment	2016-07-05 15:01:50 -07:00
Timothy Hunter	ce7e20ae6d	set the solver	2016-07-05 13:46:19 -07:00
Timothy Hunter	def20479a1	linear regression	2016-07-05 13:42:56 -07:00
Timothy Hunter	979ebd5d0f	Merge pull request #75 from jkbradley/kmeans Added kmeans test	2016-07-05 10:14:11 -07:00
Joseph K. Bradley	9d11a601c3	added kmeans test	2016-07-01 18:00:49 -07:00
jkbradley	3d3443791c	Merge pull request #74 from jkbradley/dt-tests Decision tree, random forest, GBT classification perf tests	2016-07-01 17:40:16 -07:00
Joseph K. Bradley	495e2716c4	updated per code review. works in local tests	2016-07-01 17:39:28 -07:00
jkbradley	c2f0a35db4	Merge pull request #1 from thunterdb/1606-trees adding experiments to the yaml file	2016-07-01 11:46:41 -07:00
Timothy Hunter	813bd8ad59	adding more experiments	2016-07-01 10:34:42 -07:00
Joseph K. Bradley	c15d083fe7	cleanups	2016-06-30 10:45:15 -07:00
Joseph K. Bradley	ecf2eedbb8	Added decision tree, forest, GBT tests	2016-06-30 10:38:24 -07:00
Joseph K. Bradley	33a1e55366	partly done adding decision tree tests	2016-06-29 17:06:27 -07:00

1 2 3 4

182 Commits