Commit Graph

182 Commits

Author SHA1 Message Date
Nico Poggi
006f096562 Merge pull request #120 from juliuszsompolski/tpcds_notebooks
Add example notebooks for running TPCDS and update readme
2017-09-12 17:22:38 +02:00
Juliusz Sompolski
5ebb9cfb12 add some more comments 2017-09-12 16:51:26 +02:00
Juliusz Sompolski
c78f2b3a9b update readme 2017-09-12 16:40:23 +02:00
Juliusz Sompolski
ae8bcdb292 add notebooks 2017-09-12 15:43:08 +02:00
WeichenXu
f08bf31d18 add benchmark for FPGrowth (#113)
Note: 
Add a `ItemSetGenerator` class, use following algo:

1.Create P=`numItems` items (integers 0 to P-1)
2. Generate `numExample` rows, where each row (an itemset) is selected as follows:
  2.1 Choose the size of the itemset from a Poisson distribution
  2.2 Generate `size - 2` items by choosing integers from a Poisson distribution. Eliminate duplicates as needed.
  2.3 Add 2 new items in order to create actual association rules.
    2.3.1 For each itemset, pick the first item, and compute a new item = (firstItem + P / 2) % P, add new item to the set.
    2.3.2 For each itemset, pick the first 2 items (integers) and add them together (modulo P) to compute a new item to add to the set.
2017-09-04 10:48:05 -07:00
Juliusz Sompolski
bcda8fc1e5 Coalesce non-partitioned tables. (#118)
In #109 coalescing of non-partitioned tables into 1 file seems to have gotten accidentally removed.
Put it back, but only when clusterByPartitionedColumns == true
Considering that we coalesce partitions only when that setting is true, it seems to be consistent to use it also for non-partitioned tables.

It may be better to change the name of the parameter, but that changes the interface, and possibly should be left for some future clean up.
2017-09-04 18:05:42 +02:00
Siddharth Murching
3e1bbd00ed [ML-2847] Add new tests for (DecisionTree, RandomForest)Regression, GMM, HashingTF (#116)
This PR follows up on #112, adding new performance tests for DecisionTreeRegression, RandomForestRegression, GMM, and HashingTF.

Summary of changes:
* Added new performance tests
* Updated configs in mllib-small.yaml
** Alphabetized configs
** Added new configs for: RandomForestRegression, DecisionTreeRegression, GMM, HashingTF
* Refactored TreeOrForestClassification into a trait (TreeOrForestEstimator) exposing methods for all tree/forest estimator performance tests.
** Copied code from DecisionTreeClassification.scala into TreeOrForestEstimator.scala

I tested this PR by running the performance tests specified in mllib-small.yaml
2017-09-03 22:26:20 -07:00
WeichenXu
19c41464c7 fix df.drop in VectorAssembler (#117)
fix df.drop in VectorAssembler to return correct DataFrame
2017-09-01 13:51:05 -07:00
WeichenXu
6ec83fd0f7 Add benchmark for LinearSVC/OnehotEncoder/VectorSlicer/VectorAssembler/StringIndexer/Tokenizer (#112)
Add benchmark for:

LinearSVC
OnehotEncoder
VectorSlicer
VectorAssembler
StringIndexer
Tokenizer
2017-08-31 13:56:43 -07:00
Juliusz Sompolski
737a1bc355 BlockingLineStream (#115)
## What changes are proposed in this pull request?

Investigating OOMs during TPCDS data generation:
it turned out that scala standard library's ProcessBuilder.lineStream would by default create a LinkedBlockingQueue buffer of Integer.MAX_VALUE capacity.
It surfaced after in https://github.com/databricks/tpcds-kit/pull/2 we implemented 10x improvements to dsdgen speed.
Now spark-sql-perf does not keep up with ingesting data from dsdgen, and the buffer will cause OOMs.

Pulled out pieces of ProcessBuilderImpl and ProcessImpl just to create a LinkedBlockingQueue with maxQueueSize=65536 instead.

Also submitted https://github.com/scala/scala/pull/6052

## How was this patch tested?

- ssh on the worker - see that dsdgen is being throttled now, Java memory doesn't explode.
- tested that TPCDS SF100 generated correctly.
2017-08-31 15:16:22 +02:00
Siddharth Murching
9febc34f66 Refactor MLParams for spark-sql-perf (#114)
A case class (MLParams) is currently used to store/access parameters for ML tests in spark-sql-perf. With the addition of new ML tests to spark-sql-perf (in this PR: #112), the number of ML-related test params will be > 22, but Scala only allows up to 22 params in a case class.

This PR addresses the issue by:
* Introducing a new MLParameters class (class MLParameters) that provides access to the same parameters as MLParams, except as a class instead of a case class.
* Replacing usages of MLParams with MLParameters
* Storing the members of MLParameters in BenchmarkResult.parameters for logging/persistence.

Tested by running default performance tests in src/main/scala/configs/mllib-small.yaml.
2017-08-28 13:23:59 -07:00
Siddharth Murching
d0de5ae8aa Update tests to run with Spark 2.2, add NaiveBayes & Bucketizer ML tests (#110)
* Made small updates in Benchmark.scala and Query.scala for Spark 2.2
* Added tests for NaiveBayesModel and Bucketizer
* Changed BenchmarkAlgorithm.getEstimator() -> BenchmarkAlgorithm.getPipelineStage() to allow for the benchmarking of Estimators and Transformers instead of just Estimators

Commits:
* Changes made so that spark-sql-perf compiles with Spark 2.2

* Updates for running ML tests from the command line + added Naive Bayes test

* Add Bucketizer test as example of Featurizer test; change getEstimator() to getPipelineStage() in
BenchmarkAlgorithm to allow for testing of transformers in addition to estimators.

* Add comment for main method in MLlib.scala

* Rename MLTransformerBenchmarkable --> MLPipelineStageBenchmarkable, fix issue with NaiveBayes param

* Add UnaryTransformer trait for common data/methods to be shared across all objects
testing featurizers that operate on a single column (StringIndexer, OneHotEncoder, Bucketizer, HashingTF, etc)

* Respond to review comments:

* bin/run-ml: Add newline at EOF
* Query.scala: organized imports
* MLlib.scala: organized imports, fixed SparkContext initialization
* NaiveBayes.scala: removed unused temp val, improved probability calculation in trueModel()
* Bucketizer.scala: use DataGenerator.generateContinuousFeatures instead of generating data on the driver

* Fix bug in Bucketizer.scala

* Precompute log of sum of unnormalized probabilities in NaiveBayes.scala, add NaiveBayes and Bucketizer tests to mllib-small.yaml

* Update Query.scala to use p() to access SparkPlans under a given SparkPlan

* Update README to indicate that spark-sql-perf only works with Spark 2.2+ after this PR
2017-08-21 15:07:46 -07:00
Yin Huai
b3a6ed79b3 Start the development 0.5.0-SNAPSHOT 2017-08-21 14:21:19 -07:00
Bogdan Raducanu
4e7a2363b9 Support for TPC-H benchmark
Refactored TPC-DS code to be able to reuse it for TPC-H.
Added TPC-H queries texts adapted for Spark.
2017-08-09 12:26:32 +02:00
Kevin
fdcde7595c Update README (#107)
Little update for the README
2017-07-13 10:45:24 +02:00
Juliusz Sompolski
6488d74d23 tpcds_2_4: Add alias names to subqueries in FROM clause.
## What changes were proposed in this pull request?

Since SPARK-20690 and SPARK-20916 Spark requires all subqueries in FROM clause to have an alias name.

## How was this patch tested?

Tested on SF1.
2017-06-29 16:04:08 +02:00
Juliusz Sompolski
bff6b34f62 Tweaks and improvements (#106)
Data generation:
* Add an option to change Dates to Strings, and specify it in Tables object creator.
* Add discovering partitions to createExternalTables
* Add analyzeTables function that gathers statistics.

Benchmark execution:
* Perform collect() on Dataframe, so that it is recorded by SQL SparkUI.
2017-06-13 11:42:14 +02:00
Juliusz Sompolski
75f3876e59 Merge pull request #103 from juliuszsompolski/fixtypes
Correct types of keys in schema
2017-05-26 11:53:19 +02:00
Juliusz Sompolski
2ddd521ab5 ok, make it long only where really needed. 2017-05-26 10:36:40 +02:00
Juliusz Sompolski
1bca964a3d Correct types of keys 2017-05-25 17:12:47 +02:00
Volodymyr Lyubinets
beec62844d Merge pull request #101 from vlyubin/master
Add tpcds 2.4 queries
2017-05-16 10:35:35 +02:00
vlyubin
c0bd21c2ec Add ss_max 2017-05-16 10:29:00 +02:00
vlyubin
e5dc6f338f Updated queries 23 2017-05-15 17:30:20 +02:00
vlyubin
e8f85b0b0e Moved queries into a separate folder 2017-05-15 14:22:37 +02:00
vlyubin
96bf10bffc Add tpcds 2.4 queries 2017-05-12 11:54:32 +02:00
Eric Liang
c12b14b013 Merge pull request #98 from databricks/parallel-runs
Add option to avoid cleaning after each run, to enable parallel runs
2017-03-15 13:50:41 -07:00
Eric Liang
64728c7cff Add option to avoid cleaning after each run, to enable parallel runs 2017-03-14 19:45:27 -07:00
Timothy Hunter
53091a1935 Removes labels from tree data generation (#82)
* changes

* removes labels

* reset scala version

* adding metadata

* bumping spark release
2016-12-13 16:47:31 -08:00
srinathshankar
685c50d9dc Cross build with Scala 2.11 (#91)
* Cross build with Scala 2.11

* Update snapshot version
2016-10-03 17:01:17 -07:00
srinathshankar
0eaa4b1d57 [SC-4409] Correct query 41 in TPCDS kit (#90) 2016-09-30 18:02:39 -07:00
Josh Rosen
c2224f37e5 Depend on non-snapshot Spark now that 2.0.0 is released
Now that Spark 2.0.0 is released, we need to update the build to use a released version instead of the snapshot (which is no longer available).

Fixes #84.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #85 from JoshRosen/fix-spark-dep.
2016-08-17 17:53:30 -07:00
Timothy Hunter
948c8369e7 Fixes issues with scala 2.11
Updates the usual scala-logging issues to make the source code cross-compilable between scala 2.10 and scala 2.11.

Tests:
A scala 2.11 version of the code has been run against the official Spark 2.0.0 RC4 binary release (Scala 2.11)
A scala 2.10 version has been run against the official Spark 1.6.2 release

Author: Timothy Hunter <timhunter@databricks.com>

Closes #81 from thunterdb/1607-scala211.
2016-07-19 11:19:52 -07:00
Timothy Hunter
8830bffd46 Merge pull request #79 from jkbradley/tree-test-fix
Fixed tree, forest, GBT tests by adding metadata to DataFrames
2016-07-11 10:42:19 -07:00
Joseph K. Bradley
51469a34d6 Fixed tree, forest, GBT tests by adding metadata to DataFrames 2016-07-11 10:33:19 -07:00
Timothy Hunter
1fcc366cec Merge pull request #78 from thunterdb/1607-fixes
Adding parameters in case of failures
2016-07-06 11:34:05 -07:00
Timothy Hunter
c7d42d3626 adding parameters 2016-07-06 11:23:07 -07:00
Timothy Hunter
2672bcd5b7 ALS algorithm for spark-sql-perf
This has been tested locally with a small amount of data.

I have not bothered to reimplement a more robust version of the ALS synthetic data generation, so it will still require some manual parameter tweaking as before.

Author: Timothy Hunter <timhunter@databricks.com>

Closes #76 from thunterdb/1607-als.
2016-07-05 15:54:08 -07:00
Timothy Hunter
93c0407bbe Merge pull request #77 from thunterdb/1607-linear
Linear regression
2016-07-05 15:41:35 -07:00
Timothy Hunter
40e97ca3c0 comment 2016-07-05 15:01:50 -07:00
Timothy Hunter
ce7e20ae6d set the solver 2016-07-05 13:46:19 -07:00
Timothy Hunter
def20479a1 linear regression 2016-07-05 13:42:56 -07:00
Timothy Hunter
979ebd5d0f Merge pull request #75 from jkbradley/kmeans
Added kmeans test
2016-07-05 10:14:11 -07:00
Joseph K. Bradley
9d11a601c3 added kmeans test 2016-07-01 18:00:49 -07:00
jkbradley
3d3443791c Merge pull request #74 from jkbradley/dt-tests
Decision tree, random forest, GBT classification perf tests
2016-07-01 17:40:16 -07:00
Joseph K. Bradley
495e2716c4 updated per code review. works in local tests 2016-07-01 17:39:28 -07:00
jkbradley
c2f0a35db4 Merge pull request #1 from thunterdb/1606-trees
adding experiments to the yaml file
2016-07-01 11:46:41 -07:00
Timothy Hunter
813bd8ad59 adding more experiments 2016-07-01 10:34:42 -07:00
Joseph K. Bradley
c15d083fe7 cleanups 2016-06-30 10:45:15 -07:00
Joseph K. Bradley
ecf2eedbb8 Added decision tree, forest, GBT tests 2016-06-30 10:38:24 -07:00
Joseph K. Bradley
33a1e55366 partly done adding decision tree tests 2016-06-29 17:06:27 -07:00